As high-performance computing (HPC) moves into the exascale era, computer scientists and engineers must find innovative ways of transferring and processing unprecedented amounts of data. As the scale and complexity of the applications running on these machines increases, the cost of their interactions and data exchanges (in terms of latency, energy, runtime, etc.) can increase exponentially. In order to address I/O coordination and communication issues, computing vendors are developing an intermediate layer between compute nodes and the parallel file system composed of different types of memory (NVRAM, DRAM, SSD). These large scale memory appliances are being called 'burst buffers.' In this paper, we envision advanced memory at various levels of HPC hardware and derive potential use cases for how to take advantage of it. We then present the challenges and issues that arise when utilizing burst buffers in next-generation supercomputers and map the challenges to the use cases. Lastly, we discuss the emerging state-of-the-art burst buffer solutions that are expected to become available by the end of the year in new HPC systems and which use cases these implementations may satisfy.
Introduction
Advanced levels of memory hierarchy, also called "burst buffers", have the capacity to enhance and accelerate scientific applications on high-performance computing infrastructures. However, in order to utilize burst buffers at different levels throughout the highperformance computing system, there are a number of challenges that must be addressed.
In this document, we highlight specific areas of concern for utilizing advanced memory hierarchies and illustrate the focal points of burst buffers as they apply to specific use cases. This document is not meant to propose solutions to these challenges, but rather to point out where they exist in the system.
In order to understand the High-Performance Computing (HPC) system with burst buffer capabilities, we formed an Abstract Machine Model (AMM). We based our initial model on the existing architecture of a traditional HPC resource, and then added memory appliances at several layers in the model, e.g., Node-Local, Board-Local, Intermediate Storage, and I/O Subsystem, as seen in Figure 2 . We use the word "burst buffer" in this document in order to have a general, consistent way of referring to memory at different levels of the system. It is important to note that different names may be given to these memory appliances in the future, which should only affect the naming convention expressed herein, but not the main ideas conveyed.
In addition to the AMM, we present a Memory Hierarchy structure in Figure 1 . In this figure, memory is assumed to be available in larger quantities as you move away from the node main memory (at the top), but the cost of reading/writing to these farther away structures is much higher than reading and writing directly from and to the node memory. As part of our investigation, we have identified 3 primary areas of the HPC system that the utilization of multi-tiered burst buffers will have an impact on: (i) Applications, (ii) Programming System / Runtime Environment, and (iii) the underlying Operating System. It is important that an application or operating systems programmer is aware of the cost and overheads associated with utilizing additional memory appliances.
The rest of the document is organized as follows: Section 2 provides the motivating use cases for advanced memory hierarchies. Section 3 illustrates specific challenges and considerations that arise in offering a multi-tired burst buffer approach to support multiple applications and multiple users on an HPC system. Finally, in Section 4, we create a table that applies the general challenges identified in Section 3 to the specific use cases identified in Section 2. The specific application of the general challenges to use-case-specific scenarios provides an opportunity to identify further areas of research that must be explored in order to realize advanced memory hierarchies into future architectures. 
Burst Buffer Use Cases
The following explains different use cases which can be enhanced by the use of burst buffers. The list of use cases is loosely based on slides from Lawrence Livermore National Laboratory [4] and Sandia National Laboratories [6] . This section aims only to discuss the data flow for each scenario, without mentioning specific locations where the data is stored. In further sections, we will discuss how and where burst buffers can be utilized, as well as what considerations apply specifically to the given use case.
1. Checkpoint-Restart: Applications periodically write 'checkpoints' of their data, i.e., snapshots of data structure or program progress. If a failure occurs (either node or data error), the applications are able to restart using a known 'good' checkpoint.
2. In-Situ/In-Transit Analytics/Visualization: Data coming from simulations can be processed while it remains in the system. Also useful for visualization in the case where a node shares both a CPU and a GPU. Critical results or animations may be drained to parallel file system if they will be required at the end of the workflow.
3. Accelerated Reads (aka Prefetching ): Stage data to 'closest' memory in advance of when it will be read.
Out-of-Core:
The main memory on a node is not enough for the application that is currently running. As a result, advanced levels of memory hierarchy may be utilized, however, with the understanding that the 'main memory' is still the fastest and 'best' option.
5.
Staged Writes (for Post-Processing): Files that already exist in the parallel file system (PFS) need to be processed on a compute node or a GPU at a later time than the time they were processed.
6. Application/Workflow Coupling/Exchange: Scientific workflows with coupled application components must share, exchange, modify, and communicate data to one another during runtime.
Burst Buffer Challenges and System-Level Considerations
In this section, we introduce generalized 'burst buffer'-specific issues that arise when considering advanced memory capable HPC systems. (d) Application/Workflow Connectivity: How does a workflow or application obtain access to a given memory allocation? Is it exposed as a block of a filesystem? Does it follow shared-memory models?
(e) Data Privacy: How do we ensure that data that was once stored on burst buffer resources is no longer available once an application/workflow has deallocated the resource (e.g., encrypted storage of data)?
Priority (PR)
(a) Allocation Priority: Can we define policies for the shared access between burst buffers? For example, do we provide the ability for a node or an application to access another node's on-node burst buffer? Does application running on compute node have full priority to on-node burst buffer? Do use cases have different priorities to specific levels of memory based on their importance to overall application/workflow? Can an already allocated memory space be overtaken by a higher priority user/application? If so, how does system adjust? Can two different users share an allocation on a single compute node? For example, if one application is shutting down and using half of it, can pre-staging for the next application occur in the other half of the burst buffer? How do we keep track of what usage mode (use case) we are using in order to enforce priority? What component enforces the priority?
(b) Higher Priority as Billable Feature: Should we allow machine users willing to pay more a greater priority over certain memory appliances, possibly closer to their computation?
(c) Eviction: What happens to data when a higher priority use case or allocation takes over? Can system get rid of unused data, or should all data be evicted to another memory location (possibly further away)? in a burst buffer after application or workflow shut down? Should it just be immediately deleted since the workflow is over and one can assume if it was not consumed, it is not relevant? Or does it need to be flushed to a permanent storage solution, such as parallel file system? Who is responsible for transferring the data to the parallel file system? How is it stored -does all data in memory get written to a specific directory in the user's home directory? (c) Garbage Collection: How often should system search burst buffers for dirty bits/memory and clear them? Does system need to keep track of all levels (and locations) of memory that are touched by each application or workflow? After workflow shut down (and any flush to permanent file system), does the system move systematically through each level of memory to clean up data associated with the application or workflow? Is this triggered at application shutdown or periodically or when storage is close to filling up?
Consistency (CON)
(a) Multiple Read/Write Requests to Burst Buffer : Does each burst buffer need multi-threaded controller code in order to handle large magnitudes of read/write requests? (b) Burst Buffer Guarantees: What consistency policies will user be guaranteed using burst buffers? Do these policies differ at different burst buffer levels? How will we ensure burst buffer constraints are not violated? How do we ensure the correctness and validity of transactions between applications, burst buffers, and file system? 6. Coordination (C) (a) Data Movement: How will data be transferred throughout the different layers of memory? Can we quantify the overheads for transferring between levels? What system will be responsible for transporting data using RDMA? (b) Access Coordination: What facilities will be provided by burst buffers for coordination of access (e.g., shared mem/token exchange)? What facilities will be provided by higher software layers (e.g., locking)? (c) Coordination of Burst Buffer : How to coordinate/compose multiple burst buffers belonging to same application? How to expose the composition of burst buffers to user? How to decide which physical burst buffer to store data in when multiple burst buffers exist?
4 Table   Please see the following pages for a table applying 3) PR: Allocation Priority: If some workflow components are already running and writing to BB, priority to BB should be given to the rest of the applications comprising the workflow. 4) CON: Locking for global and local data is very important in a coupling scenario, to preserve versioning information, avoid overwriting data, and prevent tasks from processing the same regions 5) PST: Data Lifetime: Data comprising a workflow must persist until all components of the workflow intending to use that data have completed. This means data cannot be erased after one component of workflow completes. 6) PST: Flush: If workflow ends and data is still in BB, how to determine if this data is relevant to be stored to PFS or if it should be deleted? 7) PST: Garbage collection: Triggered after flush of data, must clean any BBs that workflow accessed 8) CON: Multiple Read/Write Requests to BB / Will requests need to queue if maximum threads are reached? 9) C: Data Movement: Movement from different levels of memory must be coordinate based on the workflow access patterns 10) C: Coordination of Burst Buffer: How can multiple levels of burst buffers be exposed to workflow? How does OS decide best physical location or memory layer to store data in the presence of multiple BBs?
In this section, we discuss the burst buffer systems developed by two top High-Performance Computing vendors, Cray and Data Direct Networks (DDN). Both have similar hardware implementations in that the extra memory appliance they have created is comprised of solid-state drives (SSD) and acts as an intermediate layer between the compute nodes and the underlying file system. In these first-generation implementations, burst buffers are exposed to the end user as mountable filesystems, or, alternatively, I/O is autonomically managed by the underlying controller or operating system. As burst buffers become more widely adopted, it may be beneficial to expose more features to the end user and/or software developers in order to fully utilize their capabilities in the use cases mentioned in Section 2. Cray DataWarp [1] utilizes flash SSD I/O blades with the Aires high-speed interconnect network. In DataWarp, the burst buffer can be used as a global storage cache for the parallel file system. In this usage mode, its focus is on I/O acceleration and optimization across the machine. According to Cray, DataWarp users can allocate the type and amount of data storage they need, in addition to defining the I/O movement on a per job, per process, per rank, or per node scale. Using the machine queuing system, such as SLURM [5] , users can make a persistent reservation of burst buffers, i.e., a separate reservation from a job reservation -the two are independent of one another in this usage mode. In the production release of DataWarp, data striping across multiple SSD nodes is possible. Storage is dynamically allocated, meaning that a user has the option to interact with their burst buffer reservation as if it were a 'scratch' file system (e.g., looks the same as a mountable filesystem), or can set up a burst capability which is local to the compute nodes for faster Checkpoint-Restart (e.g., acts more like a cache). It could also be utilized by the underlying operating system to capture 'bursty' application behavior in order to optimize the usage of the parallel file system (PFS) and minimize network traffic.
Of the use cases discussed in Section 2, we have identified the following areas where we envision DataWarp being utilized in scientific application workflows. The first such use case is Checkpoint-Restart. In checkpoint-restart, the Intermediate layer (burst buffer) could hold checkpoint data and only bleed the larger checkpoints to the PFS when it is most efficient and does not overwhelm the filesystem or network. Additionally, keeping the checkpoint data in the SSD rather than writing to PFS makes potential restarts faster, especially since the allocated SSD storage can persist even when a job fails. We also envision DataWarp as supporting out-of-core use cases, since it can dynamically allocate more memory in order to accommodate 'bursty' applications. It can also utilize such extra memory to ensure peak performance of the PFS. Additionally, utilizing some type of 'controller' program, the SSD could potentially also be used to pre-fetch data from PFS to SSD before a job starts up. From the data that has been released so far, it is unclear if the underlying operating system will support prefetching itself. However, if not, a controller program or user service could run on a compute node or in user space that initiates data transition from PFS to SSD after reserving a persistent DataWarp area but prior to a job running.
Data Direct Networks has also released its own burst buffer solution, called the Infinite Memory Engine (IME) [2] . In contrast to DataWarp, DDN IME can be added to existing machines. It is also implemented as an intermediate array of memory between compute nodes and the filesystem. When this intermediate memory is SSD-based, DDN recommends a successful burst buffer solution would have 2 to 3 SSDs per compute node. The utilization of IME is independent of the high-speed interconnects in the machine. This means that DDN can connect to InfiniBand, Cray Aires, or other networks. In addition, IME accommodates wide ranges of compute node vendors and storage vendors, making it as modular as possible so users can select what they need for their specific existing systems. IME utilizes its own controller to manage the coordination and communication with the SSDs. Similar to DataWarp, the memory is exposed again in a 'hands-off' manner to the user; it is built upon the principle that applications need not be modified in order to achieve I/O acceleration using IME and can be mounted with a filesystem view to the application. Additionally, IME automatically detects when the I/O network is flooded and captures some of the traffic in order to optimize the utilization of the high-speed interconnects. If necessary, it communicates with the PFS in order to store persistent information to files. Lastly, IME enables post-processing and visualization/analysis in multiple scientific applications by allowing the different coupled components to manipulate common datasets in real-time. Although this solution has not been fully deployed, DDN has released information that using IME the unmodified scientific application S3D [3] (a well-known model for turbulent flow) was run with and without IME services and experienced 3 orders of magnitude I/O acceleration and 2 orders of magnitude PFS acceleration when utilizing IME [2] .
Similarly to DataWarp, IME would be useful in the Checkpoint-Restart use case (for the same reasons as mentioned above). Because of its ability to self-manage high network traffic, it would also be suitable for out-of-core applications and workflows. It's touted ability to facilitate fast post-processing via analysis, manipulation, and workflow processing would also be extremely valuable. If intermediate data is kept in the IME storage appliance, post-processing applications would be able to read this information more quickly and only write to PFS the end results that the domain scientist is interested in. Lastly, with these current specifications, DDN IME could support different workflow/application coupling techniques, with support for ensembles, visualization, and rapid exchange of possibly common data structures.
Conclusion
In this paper, we have proposed several use cases for the next-generation memory appliances, often called 'burst buffers,' that are emerging in hardware. In the current state of the art offerings, this burst buffer is considered only an Intermediate Storage between compute nodes and PFS. However, we have discussed how similar storage appliances, mixing various memory options (e.g., DRAM, NVRAM, SSD, etc.), would be useful at several different areas of the HPC architecture. We then illustrate possible use cases for this extended memory architecture, as well as point out issues that may arise when supporting such use cases. Lastly, we included a discussion of current state of the art solutions. It is important to note that this discussion was based upon the currently available information. Some information may still be proprietary, and other vendors may release competitive memory appliance solutions. It is also important to note that the ability of current solutions (i.e., DataWarp and IME) to support some of the use cases discussed in Section 2 does not necessarily mean that this Intermediate layer can be considered a complete solution. Certainly, more complex storage schemes and data scheduling can be achieved with the memory views described in Figure 1 and Figure 2 . As 'burst buffers' become more widely adopted in the supercomputing community, we can build a more complete picture of the direct interaction between the hardware and our scientific workflow use cases.
