Abstract-To achieve exascale computing, fundamental hardware architectures must change. This will significantly impact scientific applications that run on current high performance computing (HPC) systems, many of which codify years of scientific domain knowledge and refinements for contemporary computer systems. To adapt to exascale architectures, developers must be able to reason about new hardware and determine what programming models and algorithms will provide the best blend of performance and energy efficiency in the future. An abstract machine model is designed to expose to the application developers and system software only the aspects of the machine that are important or relevant to performance and code structure. These models are intended as communication aids between application developers and hardware architects during the co-design process. A proxy architecture is a parameterized version of an abstract machine model, with parameters added to elucidate potential speeds and capacities of key hardware components. These more detailed architectural models enable discussion among the developers of analytic models and simulators and computer hardware architects and they allow for application performance analysis, system software development, and hardware optimization opportunities. In this paper, we present a set of abstract machine models and show how they might be used to help software developers prepare for exascale. We then apply parameters to one of these models to demonstrate how a proxy architecture can enable a more concrete exploration of how well application codes map onto future architectures.
I. INTRODUCTION Significant changes to node and system architectures will be required to reach exascale performance goals. Because software refactoring naturally lags hardware architectural change, the idea of co-design has been gaining traction as a way to reduce this lag and to simultaneously design new architectures and the software algorithms and applications that will utilize them. We propose a co-design methodology that implements Abstract Machine Models (AMMs) as a method to help software developers prepare for next-generation architectures. AMMs will provide software developers with sufficient detail Thanks to Prof. Bruce Jacob of the University of Maryland for input on NVRAM trends and to Mike Levenhagen for their participation in some of our discussions.
of architectural features to enable them to begin tailoring their codes for these new high performance machines by reasoning about data structure placement in the memory system and the location where computational kernels may be run. While more accurate models will permit greater optimization to specific hardware, a more general approach will address the more pressing issue of initial application and algorithm re-development that will be required for future computing systems. Once initial application data structures and algorithms have been formulated, further refinements on the models can be used as a vehicle to optimize the application. These models enable the following benefits to the research community: Community engagement Abstract machine models are an important means of communicating to application developers the nature of future computing systems so they can reason about how to restructure their codes and algorithms for those machines. Design space exploration The AMM is the formalization of a particular parametric model for a class of machines that expresses the design space. The purpose of each of the presented models is to abstractly represent many vendor specific hardware solutions, allowing the application developer to target multiple instantiations of the architecture with a single, high-level logical view of the machine. System software development A high-level representation of the machine enables design of automated methods (runtime or compile time) to efficiently map an application and algorithm onto the underlying machine architecture. In the face of a large number of hardware design constraints, industry has proposed solutions that cover a wide design space. These solutions range from systems with many ultra-low power processor cores executing at extreme scale to achieve high aggregate performance throughput to large, powerful processors that require smaller scales but provide performance at much higher per-node power consumption. Each of these designs blends a set of novel technologies to address the challenges laid out. To assist software developers in reasoning about the disparate ideas and solutions, we define (Section II-A) an overarching AMM designed to capture many of the proposed ideas into a single, unified view. Then we present a family of abstracted models in Section II-B through Section II-F that reflects the range of more specific architectural directions being pursued by contemporary CPU designers. Each of these models are presented in sufficient detail to support many uses, from application developers becoming familiar with the initial redesign of their applications, to system software developers and system architects who are exploring the potential capabilities of future computing devices.
In addition to presenting abstract machine models (Section II), in Section III we discuss memory architectures and how changes in process technology, bandwidth, latency, coherence, and per-core memory size affect projected future memory designs and their impact on application development. Section IV discusses programming impacts that generally pertain to all of the abstract machine models presented. Finally, we conclude in Section VI.
II. ABSTRACT MACHINE MODELS FOR
ALGORITHM DESIGN In order to sufficiently reason about application and algorithm development on future exascale-class compute nodes, we develop several AMMs that represent only a subset of machine attributes that will be important for code performance. By simplifying the myriad complex choices required to target a real machine, an AMM can be used by application developers as a model in which to frame their algorithms [13] independent of specific hardware parameters [12] . To focus on more concise models, we take one of three actions: Ignore it If ignoring the design choice has no significant consequences for power consumption, computational performance, or impact on application reasoning, we choose to eliminate the feature from the model (e.g., specific singleinstruction, multiple-data (SIMD)-vector widths). Such details are provided in the proxy architecture annotation of the model. Abstract it If the specific details of the hardware design choice are well enough understood to provide an automated mechanism to optimize a layout or schedule, an abstracted form of the choice is made available (for example, register allocation has been successfully virtualized by modern compilers). Expose it If there is no clear mechanism to automate decisions but there is a compelling need to include a hardware choice, we explicitly expose it in our abstract machine model deferring the decision of how best to utilize the hardware to the system software and application developer (e.g., multiple types of memory requires specific data structure placement by either the OS or application). For the models that follow, it is important to note that we do not fully describe the coherency aspects of the memory subsystems. These systems will likely differ from current coherency schemes and may be software-based or hardwaresupported coherent. Similarly, we expect network interfaces to be integrated into the processor, but we leave the specifics of the network topology to further evolution of our models since, in current practice, very few algorithms are designed or optimized to a specific network topology.
A. Overarching Abstract Machine Model
We begin with a single model that highlights the anticipated key hardware architectural features that may support exascale computing. We give a short description of this model, then present the most plausible set of realizations of the single model that are viable candidates for future supercomputing architectures and show how application programmers might use these models for algorithm development. Figure 1 shows an AMM of our projected exascale node architecture. It is likely that future exascale machines will feature heterogeneous nodes composed of a collection of more than a single type of processing element. Fat cores that are found in many contemporary desktop and server processors are characterized by deep pipelines, a small number of hardware threads, a multi-level memory hierarchy, hardware to support instruction-level parallelism and other architectural features that prioritize serial performance and tolerate expensive memory accesses. The alternative type of core that we expect to see in future processors is a thin core that features a less complex design that uses less power and physical die space. By utilizing a much higher count of the thinner cores a processor will provide high performance if a greater degree of parallelism (e.g., thread-level) is available in the algorithm being executed.
Application programmers will need to consider the uses of each class of core. A fat core will provide the highest performance and energy efficiency for algorithms/code characterized by high ILP or complex branching schemes leading to thread divergence, while a thin core will provide the highest aggregate processor performance and energy efficiency where parallelism (e.g., thread-level) can be exploited, branching is minimized and memory access patterns are coalesced. The need for more memory capacity and bandwidth is pushing node architectures to provide larger memories on or integrated into CPU packages. This memory can be formulated as a cache or, alternatively, can be a new level of the memory system architecture. Scratchpad memories (SPMs) have been shown to be more energy-efficient, have faster access time, and take up less area than traditional hardware cache [17] . On-chip SPMs will become more prevalent and programmers will be able to configure the on-chip memory as cache and/or scratchpad memory, allowing initial legacy runs of an application to utilize a cache-only configuration while application variants using SPM are developed.
A fundamental difference from today's node/system architecture will be the loss of conventional approaches to processor-wide hardware cache coherence. Cache coherence is characterized by poor scalability and high power consumption [17] , [11] , [16] . Alternative approaches such as hierarchical or dynamic directories still pose scalability problems [22] and Kaxiras and Keramidas [11] provide quantitative evidence that cache coherency creates substantial additional on-chip traffic and suggest forms of hierarchical or dynamic directories to reduce traffic, but these approaches have limited scalability.
Hybrid protocols, including self-invalidation of cache and flexible cache partitions, are better ideas and show some large improvements compared to hardware cache coherency with 64 cores and above [4] .
FastForward is DOE's industry-led, node level advanced technology development program (http://www.exascaleinitiative.org). Many of the Fast Forward vendors have adopted various forms of hierarchical coherence models (islands, coherence domains or coherence regions depending on the vendor implementation), which are reflected in our AMM diagram under the unified moniker of "coherence domains". It is likely that the fat cores will retain the automatically managed memories now familiar to developers, but thin cores may be grouped into several coherence domains that enable localized automatic management, with the programmer responsible for explicitly moving data between incoherent domains. It is likely that there may be almost no automatic management of memory for these thin cores leaving the full burden on the developer. Besides programming difficulties, regional coherence leads to varying access latencies. Thus it is performance-critical for application software to be aware of the topology somewhere in the software stack.
Current multicore processors are connected in a relatively simple all-to-all or ring network. As core counts increase, these networks do not scale and will give rise to more complex network topologies. Unlike the current on-chip networks, these topologies will stress the importance of locality and force the programmer to be aware of where data is located on-chip to achieve optimal performance.
The network interface controller (NIC) is the gateway from the node to the system level network, and the NIC architecture can have a large impact on the efficiency with which communication models can be implemented. A custom NIC that integrates the network controller onto the chip to reduce power is expected to also increase messaging throughput and communication performance [3] , [20] . Applications that send small and frequent messages are likely to benefit from such integration.
One important aspect of the AMM that is not directly reflected in Figure 1 is the potential for non-uniform execution rates across the many billions of computing elements in an exascale system. Performance heterogeneity will be manifested from chip level all the way up to system-level. This aspect is important because contemporary parallel computing infrastructures are largely optimized for bulk-synchronous execution models that implicitly assume that every processing element is identical and operates at the same performance. Even for systems with homogeneous computation on homogeneous cores, new fine-grained power management makes homogeneous cores look heterogeneous [18] , [5] . Non-uniformities in process technology and near-threshold-voltage for ultra-lowpower logic also create non-uniform operating characteristics for cores on a chip multiprocessor [10] . Consequently, we can no longer depend on homogeneity, which presents an existential challenge to bulk-synchronous execution models.
Abstract Model Instantiations
Given the overarching model, we now highlight and expand upon key elements that will make a difference in application performance. In a homogeneous many-core node (Figure 2 ) a series of processor cores are connected via an on-chip network. Each core is symmetric in its performance capabilities and has an identical instruction set architecture (ISA). The cores share a single address memory space, may have small, fast, local caches that operate with full coherency, may implement simultaneous multithreading (SMT), out-of-order instruction execution, or SIMD short-vector units. Per-core clock and voltage domains will enable an application developer or system runtime to individually set performance or energy consumption limits on a per-core basis. Like a system-area interconnect, the on-chip network may "taper" and vary depending on the core pair and network topology. Depending on the programming model, communication may be explicit or largely implicit (e.g. coherency traffic).
Programming Considerations: This model closely resembles the arrangement of contemporary multicore processors. Therefore, the continuity of programming models from the traditional multi-core processors is important. It is likely that current algorithms and conventional programming models (e.g., those supported by MPI and OpenMP) will map directly, with little initial change. However, the application or algorithm developer can use this model to investigate how disruptive performance variability will be for their current approach. Specifically, MPI library developers may investigate how intranode shared memory communication mechanisms will adjust to higher variability in on-chip network locality. Or, a domain scientist may use this model to consider how to account for increased per-core performance heterogeneity when operating within future power and energy constraints. In addition, the homogeneous many-core CPU needs massive parallelism to fully use the hardware threads. When lacking parallelism, application developers can either look for finer grain parallelism, or go with the emerging task parallelism approach [2] This model couples a homogeneous multi-core processor ( Figure 3 ) with a series of discrete accelerators. The processor contains a set of homogeneous cores with symmetric processor capabilities that are connected with an on-chip network. Each core may optionally utilize multi-threading capabilities, oncore caches and per-core based power/frequency scaling. Each discrete accelerator is located in a separate device and features an accelerator processor that usually has a different ISA than the multi-core CPU and may be thought of as a throughputoriented core with vector processing capabilities. The accelerator has a local, high performance memory, which is physically separate from the main processor memory subsystem. Note that this is an existing (OCLF Titan, LANL Roadrunner), but future obsolete approach that has been superseded by the integrated multi-core CPU and accelerators model (II-D).
Programming Considerations: This model also has numerous existing instantiations in systems with conventional multicore processors coupled with GPGPUs or other accelerators; many application developers are already in the process of adapting their algorithms to support this model. Much work has been done to design accelerator-aware programming models that can efficiently map those algorithms to this arrangement of multiple ISAs, different core-performance characteristics, and disjoint memory spaces. However, in the exascale time frame we see this model being replaced by the integrated multicore model (II-D). This has limited the adoption of this AMM by many multi-physics application teams. An integrated processor and accelerator model ( Figure 4 ) combines potentially many latency-optimized processor CPU cores with many accelerators in a single physical die, allowing for potential optimization through accelerator offloading. The important differentiating aspect of this model is a shared, single coherent memory address space that is accessed through shared on-chip memory controllers. While this integration will greatly simplify the programming, latency optimized processors and accelerators will compete for the memory bandwidth and the accelerators will still have a different ISA than the CPU cores.
D. Integrated CPU and Accelerators Model
Programming Considerations: The programmer is faced with the same challenges associated with potential per-multicore performance variability, increasing on-chip locality concerns, and targeting multiple ISAs. The unification of the multi-core and accelerator memory spaces can significantly reduce the overhead and latency associated with explicitly managing and transferring data between the two and will provide the programmer with the opportunity to use efficient inmemory synchronization constructs. But it will also introduce increased contention and bandwidth demands on the single memory subsystem that the algorithm and runtime system designers will have to account for.
E. Heterogeneous Multicore Model
Core This model features potentially many different classes of processor cores integrated into a single die. All processor cores are connected via an on-chip network and share a single, coherent address space operated by a set of shared memory controllers. The cores will likely have a common ISA, performance capabilities, and design, with the core designers selecting a blend of multi-threading, on-chip cache structures, short SIMD vector operations, and out-of-order/in-order execution. Thus, application performance on this architecture model will require exploiting different types and levels of parallelism. Figure 5 provides an image of this design with two classes of processor cores.
Programming Considerations: The main difference between the current model and the integrated multi-core CPU and accelerator model (Section II-D) is one of programming concerns: in the heterogeneous multi-core model each processing element is an independent processor core that can support complex branching and independent threaded, process or task-based execution. In the integrated multi-core with accelerator model, the accelerators will pay a higher performance cost for heavily divergent branching conditions and will require algorithms to be written for them using data-parallel techniques. While the distinction may appear subtle, these differences in basic hardware design will lead to significant variation in application performance and energy consumption profiles depending on the types of algorithm being executed, motivating the construction of two separate AMMs.
F. Performance-Flexible Multicore-Accelerator-Memory (MAM) Model
The MAM model is one conceptual design for a future processor based on new approaches to achieve a higher percentage of peak performance, including support of a greater variety of specialized function units. The processor features many heterogeneous, highly multi-threaded CPU cores, a hierarchical internal network and an internal NIC with multiple network connections, allowing multiple memory channels and extremely high connectivity to local and remote processor memories throughout the system. Each thread in a core is given a portion of private (by default), local on-chip memory (Mem) that can be shared with other threads. This memory will likely be hybrid, implementing both scratchpad and cache. When this memory is shared between threads, only a single copy of any shared data is kept, eliminating the need for hardware-managed coherence, but requiring some support for synchronization.
The MAM model also implements at least two different kinds of high-performance accelerators: vector units and move units (data movement engines) that are integral parts of each core. The vector acceleration units can execute arbitrary length vector instructions, unlike conventional cores that execute short SIMD instructions. These units are pipelined to support multiple vector instructions that execute simultaneously with performance largely independent of vector length. Data movement engines support operations such as multi-dimensional matrix transpose. Both the vector and move accelerators can perform their functions independent of the CPU thread processes or can be directly controlled by and interact directly with executing threads. By providing separate memory blocks, vector units have low latency memory access and execution independent of the processor cores.
Thread execution in each core is organized into blocks of time. An executing thread can have a single clock in an execution block or it can have multiple clocks. This enables the number of threads in execution and the execution power of threads to vary depending on application requirement. Thread scheduling will enable threads to be transitioned (at low cost) in and out of execution based on the latency of operations in order to support high execution throughput and hide latency.
Programming Considerations: This model provides many new opportunities to the programmer that haven't been seen in the other AMMs. It will support intense parallelism, but it will also allow the user to balance the trade-offs between using fewer more-powerful threads or more less-capable threads of execution. The developer will also have improved inter-thread communication and synchronization methods; e.g., thread-level data and private local data can be easily shared, and local memory with scratch and cache capabilities can serve as sources and destinations for multiple threads. In addition, highlevel vectorization is provided and can use as many singlepipe vectors or fewer multi-pipe, as needed by the application. Programmers can also use the processing in-memory units that handle general operations with respect to data movements: to/from main, Gather/Scatter, Transpose matrixes, sparse matrix construction, etc. Both the Vector and Move units can run independently (thus cooperatively) with respect to threads running in the same cores. Finally, application performance will benefit from high internal network and memory bandwidth for local data sharing and access to memory on remote nodes that does not require interruption of the remote process. This may change fundamental algorithm design by eliminating performance bottlenecks inherent in other system models.
III. MEMORY SYSTEM A. Memory Drivers
As the DDR-4 standard pushes engineering limits, JEDEC (the standards body that supports the development and implementation of the DDR memory standards) will not provide a DDR-5 standard [14] , [8] . This termination forces system vendors to explore alternative technologies. A potential alternative for beyond DDR-4 is to provide a hybrid memory system that integrates multiple types of memory components with different sizes, bandwidths, and access methods. There are also efforts underway to use some very different DRAM parts to build an integrated memory subsystem. These memory designs have much lower power and energy requirements compared to the current memory technology.
For example, consider a node that has two types of memory components. This hypothetical node would contain a fairly small number of 3D stacked memory that is mounted on top of or in the same carrier as the CPU chip. The bandwidth of these integrated memory parts will likely be in the low hundreds of gigabytes-per-second each, which is much higher than that of any current DDR-4 memory parts or memory modules. However, this increased bandwidth comes with lower capacity. Therefore, this high-bandwidth memory alone will be unable to support realistic applications and must be backed up with a higher capacity, lower bandwidth memory. This type of memory will likely be similar to DDR-4, and will provide the majority of the node memory capacity. These trade-offs are diagrammed in Figure 7 . Of course such a twolevel structure raises issues with respect to system software: compilers, libraries, and OS support, and other elements of the system software stack. Note that a node need not be restricted to two levels of memory and may have three or more levels with each level composed of a different memory technology, such as NVRAM, DRAM, 3D Stacked, or other memory technologies. As a result, future application analysis must account for complexities created by these multi-level memory systems. Despite the increased complexity, the performance benefits of such a system should greatly outweigh the additional burden in programming brought by multi-level memory. For instance, the amount of data movement will be reduced both for cache memory and scratch space resulting in reduced energy consumption and greater performance.
Further advantages can be demonstrated if, for example, NVRAM-which has a lower energy cost per bit accessed compared to DRAM-is used as part of main memory, allowing for an increase in total memory capacity while decreasing total energy consumption. Co-design interactions are needed to find the best ways to allow the application and system software developers to make use of these changed and expanded capabilities.
B. Future Memory Abstractions
For each of the node architectures presented in Section II-A, the layout of memory within the node is an orthogonal choice. It is possible for architectures to mix and match arrangements for compute and memory components. Due to the explosive growth in the expected number of threads in an exascale machine, the total amount of memory per socket can be in the range of one terabyte or more per node. As explained in Section III-A, it is expected that the memory system will be made up of multiple levels of memory that will trade capacity for bandwidth. This concept is familiar to application and algorithm developers who have optimized problem sizes to fit in local caches. These new memory levels will present additional challenges to developers as they select correct working set sizes for their applications. As an initial attempt to characterize likely memory sub-systems, we propose three pools of memory: High-bandwidth memory A fast, small-capacity (relatively), high-bandwidth memory technology based on new memory standards such as JEDEC's high bandwidth memory (HBM) standard [7] or WideIO [6] , or Micron's hybrid memory cube (HMC) technology [19] . Standard DRAM A larger capacity pool of slower DDR DRAM memory. Non-volatile memory A very large but slower pool of nonvolatile based memory. As shown in Figure 8a , we propose two principle approaches to integrate these memory pools: In a physical address partitioned memory system, the entire physical memory address space is split into discrete ranges of addresses for each pool of memory (Figure 8a ). This allows an operating system or runtime to decide on the location of a memory allocation by mapping the request to a specific physical address, either through a virtual memory map or through the generation of a pointer to a physical location. This system, therefore, allows for a series of specialized memory allocation routines to be provided to applications. An application developer can specifically request the class of memory at allocation time. We envision that an application developer will be able to request a specific policy should an allocation fail due to memory pool exhaustion. Possible policies include allocation failure, resulting in an exception, or a dynamic shift in allocation target to a slower memory pool. While the processor cores in this node may possess inclusive caches, there will be no hardware support for utilizing the faster memory pools for caching slower pools. This lack of hardware support may be overcome if an application developer or system runtime explicitly implements this caching behavior.
2) Multi-Level Cached Memory System: An alternative memory model layout also has multiple pools present in the node, but they are arranged to behave as large caches for slower levels of the memory hierarchy (Figure 8b ). For instance, a high-bandwidth memory pool is used as a caching mechanism for slower DDR or slower non-volatile memory. This would require hardware caching mechanisms to be added to the memory system and, in some cases, may permit an application developer or system runtime to select the cache replacement policy employed in the system. It is expected that this system will possess hardware support for the caching behavior between memory levels; however, a system lacking this hardware support could implement a system runtime that monitors memory accesses to implement an equivalent behavior in system software. An emerging technology in the memory hierarchy is 3D-stacked memory (Table I) . These stacks of memory will have a logic layer at the base to handle read and write requests to the stack. Not only will there be multiple memory dies in a single memory component, but in some versions these memory dies will be mounted directly on CPU chips resulting in greater density with a reduced energy footprint. Additionally, processorin-memory (PIM) functionality may emerge in conjunction with these stacked memory architectures. PIM capabilities offer acceleration to many memory operations, such as atomics, gather-scatter, pointer chasing, search, and other memory bandwidth intensive operations. These accelerators can execute faster and more efficiently than general-purpose hardware and can be part of the AMMs in Sections II-C or II-D. An SoC design flow creates an opportunity for many types of acceleration functions to improve application performance. These accelerators can make data movement more efficient by avoiding unnecessary copies, or by hiding or eliminating overhead in the memory system or a systems interconnect network. However, how best to expose these operations to the programmer is still an active area of research. Some 3D memory technologies, such as the HMC, allow memory parts or modules to be "chained" in different topologies. In contrast, DDR connects a small number of memory parts to a processor. Similarly, there are other standards for high-performance memory on the horizon, such as HBM that also only connects a single memory part to a single processor. "Chained" memory systems differ from DDR and HBM in their ability to support a very high memory capacity pernode. The limitations of per-node memory capacity when using chained systems would be dominated by dollar cost and total allowable power consumption. While the relative dollar cost of stacked memory is still unknown, it is expected to be more ($ per bit) than DDR. In contrast, the power cost -Joules per accessed memory bit -is expected to be significantly less for 3D stacked memory when compared to DDR.
C. 3-D Stacked Memory Systems, Processing in Memory (PIM), and Processing Near Memory (PNM)

IV. CROSS-CUTTING PROGRAMMING CONSIDERATIONS
The AMMs described in Section II-A bring about a shift in performance and programming constraints that will require careful consideration in the development of future exascaleclass applications, particularly for those demanding an evolutionary path for porting their code [1] . Due to the tradeoffs inherent in these designs to enable exascale, new methods and optimization goals will be necessary, such as enabling the use of incoherent caches/scratchpads and targeting the minimization of data movement and the maximization of data reuse for each byte loaded from memory, rather than simply a focus on increasing raw compute performance (FLOP/s). These architectural shifts must be accompanied by changes in programming models that are more adept at preserving data locality, minimizing data movement, and expressing massive parallelism.
A. Data Coherency, Movement, and Layout
As architectures move away from large, globally coherent caches, programmers will no longer be able to simply virtualize data movement across all the threads in a socket through the cache system. In the best case, regional coherence domains will automatically manage data movement between a subset of the cores; thus, programmers will need to start explicitly managing data movement between coherency domains on the chip. Between these domains there may be a relaxed consistency model, but the burden will still fall on developers to efficiently and manually share data on-chip between these domains. Moreover, configuring the amount of hardwarecontrolled cache versus software-controlled scratchpad will become an important tuning parameter. Current GPUs allow a split of 1/4, 1/2 or 3/4 of the on-chip storage between hardware-and software-managed memory, but we expect that future systems will allow for even more flexible configuration in the split between scratchpad and cache.
Today's machines already exhibit non-uniform memory access performance for various parts of memory within the node, and such issues will only get worse as on-chip parallelism increases. With core counts expected in the thousands, the issue will become even more detrimental to performance, since the differences in data movement costs between the various execution and memory component areas will become magnified. The importance of locality in new architectures that utilize explicitly managed memory systems will drive the community to develop more data-centric programming models. These advanced programming models can help abstract the data management complexities away from the program description, provide greater control over choices for expert programmers, and provide a more natural expression of those choices.
B. Performance Heterogeneity
Current parallel programming paradigms such as the bulksynchronous parallel (BSP) [21] model implicitly assume that the hardware is homogeneous in performance. During each phase of computation, each worker thread is assigned an equal amount of work, after which they wait in a barrier for everyone to complete. If all threads complete their work at the same time, this computational paradigm is extremely efficient; however, if some threads fail to complete their portion of the work, large amounts of time could be wasted by many threads waiting at the barrier for the slowest worker. As our AMMs demonstrate, the cores on a chip are expected to become more heterogeneous, requiring adaptation by the programmer, the programming model, and/or the system runtime to maintain a high level of performance.
The use of multiple types of cores on a chip is demonstrated in the accelerator and heterogeneous multicore AMMs described in Section II-A. Highly parallel tasks with a regular memory access pattern will run more efficiently on throughputoptimized thin cores, while tasks with low levels of parallelism and significant branching will run more efficiently on latencyoptimized fat cores. The programming model and runtime must be aware of this distinction and help the programmer utilize the right set of resources for the particular type of task so that the hardware may be utilized efficiently.
Future machines with higher degrees of performance heterogeneity as represented by our AMMs will require advanced runtime systems to dynamically adapt to changes in hardware performance. Current co-design opportunities for addressing these challenges include extending existing languages and parallel APIs (such as C++11 and OpenMP) and developing new, alternative languages and task-parallel runtimes (such as Charm++, HPX, and Legion, among many others). For each approach, the solution will need to provide high performance and adaptability with a low burden to application programmers.
V. PROXY ARCHITECTURES FOR EXASCALE COMPUTING
Proxy architecture models (PAMs) are a parameterized version of an abstract machine model, with parameters added to elucidate potential speeds and capacities of key hardware components and were introduced as a co-design counterpart to proxy applications in the DOE ASCAC report on the Top Ten Exascale Research Challenges [15] .
As an example of PAM usage, we identify approximate estimates for key parameters of interest to application developers. Many of these parameters can be used in conjunction with the AMM models described previously to obtain rough estimates of full node performance. These parameters are intended to support design-space exploration and should not be used for parameter-or hardware-specific optimization. This is also not an exhaustive list of parameters and the estimates may have considerable error at this early stage in the development of Exascale architectures. In particular, hardware vendors might not implement every entry in the table provided in future systems.
The lists of parameters in Table II allow application developers and hardware architects to tune any AMMs to their TABLE II: Opt1 and Opt1 represent possible proxy options for the abstract machine model. M.C: multi-core, Acc: Accelerator, BW : bandwidth, P roc: processor, For models with accelerators and cores, C denotes to FLOP/s from the CPU cores and A denotes to FLOP/s from Accelerators.
