Hybrid Memory Cube (HMC), in production by Micron Technology, is a new DRAM component that has multiple advantages over current parts including higher bandwidth, lower energy, abstract and more pin efficient interface and other benefits. The memory technology can be used as a base for even further improvements, including upgrading memory scalability to multiple terabytes and terabyte per second bandwidths per processor and resilience such that even large supercomputers with 100s of petabytes of memory will have reliable memory systems. Future systems, from desktops up, will have memory systems of multiple levels, including DRAM and non-volatile (NAND?) components that are both first-level memory capabilities, along with DRAM or SRAM scratch memory such that total data motion is greatly reduced. The result can be improved system performance and reduced system power.
INTRODUCTION
What has been called the 'memory wall' [1] has been a problem for many years (more than a couple of decades for some large science applications). The ratio of memory size to CPU performance has been continually decreasing and the same is true for the ratio of memory bandwidth to CPU performance.
Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.
MEMSYS '15, October 05 -08, 2015 , Washington DC, DC, USA Copyright is held by the owner/author(s). Publication rights licensed to ACM. ACM 978-1-4503-3604-8/15/10…$15.00 DOI: http://dx.doi.org/10.1145/2818950.2818960
To improve memory bandwidth and packaging density, High Bandwidth Memory (HBM), an oncoming JEDEC standard, and HMC from Micron Technology/Hybrid Memory Cube Consortium are coming to market. However, at least initially, HBM or HMC is introduced as another level of memory so that a small amount (say 2 to 16 Gbytes) of those very high bandwidth parts are mounted close to each CPU, with the majority of memory being DDR-4 of much larger total size, but with the size, bandwidth, and power limitations of that technology. The result is a memory system that is harder to use but still does not have the sizes or bandwidths that are desirable and needed for a great many important applications.
Another concern of current DRAM implementations is resilience. Large systems can easily have many millions of memory die. While each DRAM die is extremely reliable, the large number of parts can result in systems that have memory errors multiple times per day. The parts are vulnerable to soft errors caused by stray neutrons, second level cosmic ray particles, electrical noise, etc. Individual DRAM cells fail and whole rows or columns of memory banks fail.
Both HBM and HMC significantly reduce memory power by having a memory request go to a single part rather than one or two memory modules of 9 or 18 parts each. As a result, a failure can have a greater effect than in current systems where 'ChipKill' or single-bit error correction/double-bit error detection (SECDED) is used as all the referenced data is lost rather than a small portion, with the remaining parts enabling recovery from the failure. Both new memory types offer improved error detection and correction, but something like a row failure in one of the new memory parts is unrecoverable, where Chipkill in current memory systems can fix such errors.
What is needed is a memory technology that offers low power with high bandwidth, flexibility in usage, easy scalability, and the opportunity for high reliability/failure resilience. While lower-end systems will want to leave out any of the extra costs involved with error correction, it is important to have the flexibility to add these improvements for higher-end systems.
In contrast to HBM, HMC provides the basic capability to accomplish all of those goals, and also provides the opportunity to be extended so that terabytes of memory at terabyte per second bandwidths can be connected to a CPU chip with a very reasonable number of wires and pads. Of course memory components can fail, but upgraded HMC can be integrated into a memory system that shields users from these failures. What is proposed is a memory architecture and structure that can keep running in spite of most any failures; even complete memory modules can die and the system keeps going. In addition to the possibility of greatly increased resilience, HMC technology offers other capabilities that improve system and application performance, some of which will be noted here.
HMC memory, even as proposed here, is not the ideal memory system, particularly for high-end and exascale systems. There is a short set of suggestions with respect to a complete memory system at the end of this paper.
For those who are not familiar with HMC, here is a high level view of the component architecture:
• 4 or 8 DRAM die mounted on a CMOS Logic Base die.
The Base has multiple independent memory controllers, each controlling the portion of the die stack directly above the respective portion of the Base. A controller and its controlled memory together are called a Vault as that memory has only a single access. Each vault controller can make accesses to the multiple banks within its vault out of order, and does this to optimize sustained memory bandwidth.
• A vault has 8 memory banks if 4 high and 16 banks if 8 high.
• Multiple I/O memory links that can access any of the vaults. Each link is full duplex.
And some characteristics of the next generation of cubes:
• 4 Gbytes (4 die stack) or 8 Gbytes (8 die stack)
• 32 Vaults, each with 8 memory banks (4 die) or 16 (8 die). Thus a single cube has 256 or 512 memory banks.
• To 480 Gbytes/sec total memory bandwidth in multiple IO links; each IO link has to 120 GB/s bandwidth (to 60 GB/s in each direction, full duplex). There are many features that are described in the HMC Specification 2.0 document that are not covered here like built-in error correction, self diagnostics (BIST), variable length requests 16 to 256 bytes, in-field repair, and more. 
Note

Upgrading HMC for scalability and resilience
A major feature of HMC parts is that they can be chained. Each part has multiple channel links and logic in the part can determine that a request packet is not for that part and then send the request on another port. The latency for receiving a request, determining that it needs to be sent on a chaining port and sending on the request can be fairly short, at least relative to normal memory access time, such that multiple parts can be put into a single chain with a small to moderate effect on memory access time for requests to parts at the end of the chain.
The proposal here is a chain of nine parts. See figure 2.
The chain is accessed at both ends; the middle HMC part stores error correction data so that a failure in any of the other eight data parts can be recovered. Each byte in the ECC part holds the 'sum' (XOR) of the bytes with the same byte address in all the data parts in the chain.
That each end of the chain is active means that the bandwidth of the chain is doubled; it also means that if a part in the chain fails such that it does not communicate correctly that all the parts in the chain except that of the failing part can still be accessed. Even if the chain is broken at some point, all data can still be accessed by recovery operations described below.
When a write operation is made to one of the data parts, logic in the referenced cube reads the current data at the referenced address, XORs that data with the data to be written, and sends that difference as an update to the ECC part further down the chain. At the same time that the ECC update is being done, the write operation is completed, replacing the data at the referenced address. As is standard practice in all DRAMS, at each memory bank the read and write operations are part of the same memory function sequence, not two references.
The ECC part gets the data and does a Read/Modify/Write operation that XORs the difference data from the data cube to the current contents of the referenced address. [As Boolean data is being operated on, a XOR operation is both addition and subtraction.] The difference data update thus means that the ECC part always holds the current sum of all the data in the data parts.
Memory is initialized at the start of operations so that things start in the correct state.
Each reference request is taken from the chain as the addressed cube accepts it. This means that request traffic 'tapers' further down the chain. The result is that the ECC update requests have space in the chain to send the update requests to the ECC cube without causing additional conflicts in the chain. The result for Write requests is that the chain sees uniform bandwidth down the Error recovery, where data is uncorrectable, is a version of RAID-4-but with a difference.
In normal operation, each part uses the error detection/correction capability built into all HMC parts. If a correctable error is found within a part it is corrected directly at that point, before being returned to the same port that made the request. If a Read operation is uncorrectable an error response is returned; the recovery operation is undertaken at the next higher level.
In doing a recovery, a recovery request is sent down each port/end of the chain, also identifying the place where the chain is broken (the error response includes the number/position of the failing cube). Each cube accepts the request and also passes the request down the chain. Each cube references data at the address whose reference failed. The parts at the end of each recovery reference return their read data directly. As the data is returned back up the 'sub-chains,' the content of the data that each respective cube holds is subtracted to the returning sums. Since the ECC part holds the XOR/sum of all the cubes at the referenced address, and that the failing data is not in the final difference, that final result is the recovered contents of the failed reference.
RAID-4 is used here, rather than the more common RAID-5 because the data is accessed serially along the chain and data recovery is done along the chain so that only a single final XOR is needed. The ECC bottleneck that has made RAID-4 little used is not present here. This recovery capability is called CubeKill.
Note that a recovery operation takes roughly the same amount of time as that of a single memory reference: send the recovery requests which make all references nearly in parallel, return the recovery packets accumulating the recovery data as the requests are being returned. A recovery reference is thus about the same as a single reference to the ECC cube with respect to access latency and bandwidth.
It is certainly true that a recovery operation takes more power than a normal Read operation, but system operation can continue with no loss of performance; a running job will not know that recovery operations are being done. The power for a single recovery reference will be something like eight times the power of a normal reference. However, because those references will generally be a small part of the references to the cube chain, the power increase will generally be in the range of 15% and that increase needed only when recovery operations are being done, which will be a very small fraction of the time.
CHAINED CUBE MEMORY MODULES
A chain of HMC parts pretty much begs to be placed into memory modules. Many different module configurations can be considered. The number of module ports can vary as can the number of chains. The number of parts in each chain can vary. A couple of possibilities are shown in Figure 4 .
The external links are shown paired so that chains of modules can be implemented in the same manner as the cube chains, except that the Module Manager looks to see if the reference is for its module or is to be passed down the module chain.
Figure 4: Possible Memory Modules
There are multiple benefits for this module design in which the memory parts are implemented as chains of cubes:
• Each cube can have very low power I/O links, as connection distances are a few millimeters. The Module Manager can have higher power IO capability to enable, for example, ease of placement on motherboards that then require longer connection lengths.
• The manager controls the recovery operations needed where cubes or the chain itself has failures that involve the recovery operation described above. In addition, if modules are chained (and it is expected to be the case for large systems) then the Manager does the reference to the proper data cube in order to provide the data summing needed for recovery of the failed operation.
It is also fairly easy for the manager to actually enable writing to a failed cube. This is done by having the manager do a read recovery of the data at the cube and address to be written, XOR that data with the data to be written and then send the ECC cube that update. This enables correct read recovery of the data being written. A similar set of operations-read recovery and writes to the ECC cube-enable data from a failed cube to be moved from the failed cube to the ECC cube such that the recovery operations are no longer needed, though the ECC cube is then renamed and ECC updates for the other data cubes discontinued.
•
The module manager enables all modules to have same interface functions and protocol even in modules that have different numbers of chains and other differences.
• The same structure can enable modules to be done with NV/NAND or other type of part while keeping the same interface to the rest of the system. (Yes, there would be latency differences among other differences depending on the technology, but starting from the abstract interface that is how HMC parts interface, those differences can be made fairly invisible.)
• The module manager can have the intelligence to do additional functions like data base operations, data moves including gather/scatter, searches, and other functions that will improve system capabilities and performance at the same time that system power is being reduced because operational low-level management is done in the memory system rather than requiring lots of data movement between memory and a controlling CPU with significant cost in power,.
A chain of modules would look like the chaining figure above, except that each entity is a module rather than a single cube. The result is that whole modules can fail and system operation continues though at a higher power level as a read to a failing module is turned into a reference to each module in that chain. And it is also reasonable to have the multiple recovery operations be done such that data from the failing module is 'moved' to the ECC module so that normal operational power levels are resumed by 'renaming' the ECC module as being the failed module. The capability to keep running in the presence of failing modules in a chain is called ModuleKill.
A major benefit for chains of modules with Module Manager parts as the module's interface is that the latency for memory references is not the total number of parts in a chain. For a chain of nine modules (so 4 + 1 + 4 as for the individual module chains, totaling 64 or 128 data cubes) that is accessed at each end, the longest path is 4 Managers and 4 cubes.
If each module has eight data parts and eight data modules in a chain, there are 64 data parts and it takes fewer connections between a CPU and a chain than to a single HBM part and the bandwidth is higher. If modules are built with two chains, there are 128 data parts, but the interface to a CPU does not change at all from the single chain case, and the latency and bandwidths are the same as for the single chain. ! Most failures will be individual cubes. Those failures are recovered at the module chain level, so can have multiple failures in a system and the system keeps going without impact on users.
RESILIENCY SUMMARY
! If a whole module fails (or a link in the module chain), then each module in a module chain ('ModSet') is referenced which increases power in the Set, but system operation continues.
Logic can be added such that data from failing cubes or modules can be reconstructed in the respective ECC cube or module. This lowers power in the presence of a failure at the cost of wanting faster maintenance because of the loss of full recovery capability.
BUILDING A FULL MEMORY SYSTEM
If eight chains of modules are put together and connected to a single CPU node then the memory size is likely to be 4 Tbytes given expected memory size per die, and could be twice that if modules of reasonable size can be done with two chains. Memory bandwidth should be on the order of 2 Tbytes per second. The total pad count may support doubling these numbers. And, of course, there are also power levels to consider, but the starting point for HMC is a lot better than any other current solution, will continue to improve, and also offers places where additional engineering work can offer significant further improvements.
The SerDes interconnect in HMC IO links is much more power and energy efficient than current DDR technology and will continue to improve. A full HMC part is designed to have multiple references in multiple links all running at the same time. The chaining that is proposed here changes that so that, except for the ECC part in the middle of a chain, there are never more than half of the number of connected links active with respect to making memory references. If a chain of cubes is implemented with a single chaining link, then the maximum power seen by a single data cube is one-quarter of a cube's max power if the part has four links (likely the most common case).
In addition, references are distributed along the chain. The result of that is that, for a single chain of 4+1+4 as shown above, the 8 data parts see single references at each reference port. If the ratio of reads to writes is 2-to-1, then 6 references, 3 at each request port (2 reads and 1 write), end up making a total of 8 references to the 9 parts as a write request is turned into two references. This means that most parts will be fairly idle most of the time. If each port in a chain is receiving fairly constant and near random references with 2-to-1 Read/Write ratios, a reference has about a 30% chance of making a particular cube busy. Reference 'hotspots' raise power to the referenced parts, but also leave the other parts even less active.
There is a power cost to move requests and responses up and down the chain and through the logic in each cube through which the request passes. This is a very good reason to keep pressure on reducing power in the SerDes interface, and that is happening. There is also something at the end of this paper that may point to a non-obvious way to further lower power.
The large HMC memory as described here will be expensive. It is also true that the memory is providing features, including the opportunity to grow easily into the future that, at the same time, lowers system costs as the memory is providing benefits over and above the simple cost-per-bit numbers often used in setting the expectations of the cost of the memory portion of a system. If the memory enables running jobs that are four times larger than they can otherwise be-and in half the run time of smaller jobs in an inadequate system and with a power level that is half of any other solution-how much is that worth?
Back to Multiple Memory Levels?!
As just stated, a petabyte memory system will be expensive if implemented completely using the memory modules suggested above. A way to push things further, so that even larger memory capacities with fewer numbers of components and further reduced power levels is to use NAND as another layer of memory available to each node in a system. Doing this is not free, because it will require additional software complexity to manage and very likely cause additional work by application writers to control optimum data placement.
While the additional non-volatile (NV)/NAND memory is slower than the DRAM layer, there are multiple benefits that result from the additional capacity and the persistence capability of this additional memory. Enabling local check-pointing is only a first pass at the multiple ways that the added memory can be used. A simple example is the ability to prefetch and dump data that is needed as input and output. Also, if the intelligence of a memory system is increased as suggested above, it will be possible to keep the huge matrixes needed for high-end science in the NV memory and then move data to DRAM for use using operations like matrix transpose to support the different reference directions for good cache performance as part of the Move.
An ideal memory system, at least for the next few generations of large systems should have three levels:
• Stacked SRAM on top of each CPU. This keeps local latency low and should support scratch memory which is needed for further performance improvements and even better power reductions as a result of reduced traffic to and from main memory. (It is possible to make fairly low latency DRAMs serve that purpose, but may not the case that it is economical to do because of the likelihood of fairly low volumes.) Discussion of the possibilities of large local memories is welcome.
• Chained-Cube DRAM memory modules per the above.
• Non-Volatile memory. The memory should be packaged the same as the DRAM, including having the same physical and protocol interface. Likely want something like 4 or 8 times the size of the DRAM memory.
Implementing multiple memory levels is certainly not trivial on the hardware side, and raises multiple software aspects that will need upgrades over current capabilities. But it is also true that the result should provide enough improvement in system and application performance, as well as reduced system size and power, to fully justify the effort. The ability to have memory systems and memory performance on a matching scale as the problems we are coming to grips with will be more than excellent.
SerDes Trick?
SerDes interfaces are becoming increasingly popular because they offer high bandwidths with fewer wires and pads. However, most current SerDes implementations keep the interface active at all times to maintain clock synchronization. Clock information is encoded in each lane of a data stream so even all-zero data sees lots of data transitions so that clock information can be encoded and then recovered in those transitions. This means significant power is needed even if no user or system data is being sent through the link.
Even a memory system that is busy on average sees fairly large times where a good portion of the memory is not busy at all, so would like to reduce the power levels of un-busy links while being able to quickly re-enable a power-reduced link to come back to full performance. Currently, if clock sync is lost-for example if a link is powered down to reduce power, and then brought back up when the link's bandwidth is again needed-the sync 'training' takes a fairly long while, such that in most cases a link that is not 'earning its way' for a period of time is simply kept active to avoid the activation delay. The increase in average power makes a powerful incentive to find ways to reduce those levels.
There is a circuit trick that I believe enables inactive SerDes links to have a very low power state when there is no data being sent, but to then become active again at full bandwidth, as needed, with little to no delay. If interested in the capability see me: Dave Resnick. Understand that there will be some engineering development needed to make the capability real again, and what needs to be done is essentially analog. With respect to the 'make the capability real again' remark in the previous sentence: The idea was demonstrated many years ago (1987) after being published in a book-F.D. Waldhauer, Feedback, Wiley 1982-and has apparently been lost. Would like to try to resurrect this capability, as the power benefit should be significant.
REFERENCES
