Will refactoring applications for current-generation systems apply to future-generation systems? Is there help available for this enormous task? Centers of Excellence are mutually beneficial collaborations between application developers and vendors to address these issues: vendors get a better understanding of user needs, and users gain a better understanding of how to address their challenges. You'll have to do greater detective work to uncover evidence of application developers and computer scientists working behind the scenes to figure out how to run scientific simulations on these systems when compilers were still being developed and operating systems had critical bugs. Folklore passed down to us from our older colleagues is filled with examples of a close working relationship between supercomputer users and the vendors that provided these systems in an early incarnation of codesign. These early collaborations led to the development of compilers when none existed or new programming styles for vector processors.
D
oing an Internet search on the "history of supercomputing," you'll likely see timelines with dates and photos when first-ofits-kind architectures hit the scene. Considering the petaflop (10 15 floating-point operations [FLOPS] per second) capabilities of modern systems, you might chuckle when you read about the specifications of early architectures-for example, the Cray 1 in 1976 delivered 160 Mflops at a whopping 80 MHz!-and be amazed by the jump in FLOPS enabled with each successive new system.
You'll have to do greater detective work to uncover evidence of application developers and computer scientists working behind the scenes to figure out how to run scientific simulations on these systems when compilers were still being developed and operating systems had critical bugs. Folklore passed down to us from our older colleagues is filled with examples of a close working relationship between supercomputer users and the vendors that provided these systems in an early incarnation of codesign. These early collaborations led to the development of compilers when none existed or new programming styles for vector processors.
The practice of co-design has always existed in the use of cutting-edge supercomputers and was first formalized in a US Department of Energy (DOE) high-performance computing (HPC) system procurement in 2005 through the establishment of the Cray Center of Excellence at Oak Ridge National Laboratory (ORNL) to support scientific applications transitioning to the Cray X1E and Cray XT3 (http://investors.cray.com/phoenix .zhtml?c=98390&p=irol-newsArticle&ID=676362). In 2009, ORNL created the Center for Accelerated Application Readiness (CAAR), where six targeted application teams worked closely with experts from Cray and Nvidia for two years prior to system delivery to prepare for the disruptive change of the Titan system with 18,668 nodes using general-purpose graphics processing units (GPGPUs) for the first time (https://www.olcf.ornl.gov/titan). Titan's success, demonstrated by the large number of scientific applications ready to use the system, established the Center of Excellence (COE) approach of integrating vendor expertise with application developers as a DOE best practice in large-scale systems procurement.
These efforts laid the groundwork for codesign but focused more on enabling users with improved programming know-how to efficiently use the provided system. Ideally, the process of co-design for procuring HPC systems would be a two-way street, "with application software influencing hardware design tradeoffs, while also recognizing that applications and supporting software must be developed in anticipation of hardware changes." 1 The information to impact next-generation hardware design seems to be falling through the cracks, as each successive system becomes more complex and disruptive, and less productive for application developers. This is in part because the high-volume commercial market largely drives processor development, rather than the smaller HPC market.
Recently, a full-procurement lifecycle co-design approach has been defined and formalized into the DOE Office of Science and National Nuclear Security Administration (NNSA) HPC system procurement process. In addition to the contract for system purchase, the DOE has implemented NonRecurring Engineering (NRE) awards during the last two major joint system procurements, Trinity /NERSC-8 and CORAL (www.nersc.gov/users /computational-systems/cori/nersc-8-procurement /trinity-nersc-8-rfp; https://asc.llnl.gov/CORAL). The NRE provides early requirements gathering and joint development of system design, such as implementing a burst buffer, advanced power management, and software stack components, and accelerating other vendor technology to DOE users with user feedback.
As the COE matures, vendors and subject matter experts begin to interact and communicate frequently. The frequent two-way communications impacts the current-generation system software and programming environment as well as the next-generation system and processor design. The interactions focus on exposing real (often legacy) application needs to the vendor experts who then feed the information back to the hardware (processor and system) design team as well as the software (tools, compiler, and applications) team. It is through this collaborative process that the needs of HPC applications have a voice in the next-generation processor and system development.
This article describes the Trinity COE and its co-design activities over the past two years among application developers at Los Alamos, Sandia, and Lawrence Livermore national labs and vendor partners Cray and Intel. Trinity stands at nearly 20,000 nodes, roughly half with Intel Haswell Xeon processors and half with Intel Xeon Phi Knights Landing (KNL) processors. COE activities began over a year prior to receiving the first KNL nodes and a year and a half prior to receiving KNL nodes with the Cray programming environment and operating system. A co-design process is more essential now than ever before, as multiple vendors deliver hardware and software, and how the components will work together is less predictable. The benefits of having more human effort with diverse views and experience brought together through the COE to explore the unknown cannot be overstated.
Trinity
The New Mexico Alliance for Computing at Extreme Scale (ACES), a joint Los Alamos National Laboratory (LANL) and Sandia National Laboratories (SNL) partnership, awarded Cray the contract to procure Trinity on 9 July 2014 (https://nnsa The information to impact next-generation hardware design seems to be falling through the cracks, as each successive system becomes more complex and disruptive, and less productive for application developers.
.energy.gov/mediaroom/pressreleases/trinity). Trinity was designed to meet the unique mission needs of the NNSA Stockpile Stewardship Program (SSP) and help prepare the DOE Advanced Simulation and Computing (ASC) Program for future systems on the path toward exascale computing. It is used by scientists at Los Alamos, Sandia, and Lawrence Livermore national laboratories to run the largest and most demanding simulations to ensure the safety, security, and effectiveness of the US nuclear stockpile without the use of underground testing.
Trinity is a single-system Cray XC40 with both Intel Xeon processors (codename Haswell) and Intel Xeon Phi 7250 processors (codename KNL). The Haswell compute nodes have two 16-core Haswell processors operating at 2.3 GHz with 128 Gbytes DDR4-2133 memory. 2 The KNL compute nodes contain a single 68-core KNL processor operating at 1.4 GHz with 96 Gbytes DDR4-2400 and 16 Gbytes of on-package high-bandwidth memory (HBM). Table 1 compares Trinity system specifications to its predecessor, Cielo, 3 a Cray XE6 that was recently retired in 2016. Trinity provides greater than two times more nodes, roughly four times more on-node memory, and eight times more onnode total system memory over Cielo, which was a key driver for Trinity's design to meet the needs of the ASC's simulation workload.
Trinity Center of Excellence
The Trinity COE provides a long-term collaboration between subject matter experts (SMEs) from Cray and Intel, along with ASC codes teams, to support the transition of key scientific applications to Trinity. A wide range of computational motifs and approaches are important to mission-class applications, including hydrodynamics, linear solvers, deterministic transport, Monte Carlo, molecular dynamics, and material contact, among others.
The ASC codes are inherently complex due to the nature of the problems they address, which creates challenges for any code development activity. These multiphysics applications often have millions of lines of code and depend on many third-party software packages. They support computational campaigns that are long-running with performance behavior that can change drastically throughout the lifetime of the run and with different inputs. Coupling this with the requirement to port and achieve performance on existing and future architectures is an ongoing struggle for application developers.
Application Challenges
Prior to HPC systems with GPGPUs and manycore processors like the Intel Xeon Phi processor KNL, application developers could achieve increased performance with minimal code development effort from the increased frequency and the number of cores on the node with each new system. They continued to use distributed memory programming (MPI) and scalar processing, where one pair of operands is operated on sequentially. Today, systems require not only MPI but also mechanisms to exploit the high levels of concurrency inside the node. To achieve increased performance on the Trinity system with the KNL processors, applications must exploit the hardware threads as well as the vector processing capability.
Effective use of Trinity involves on-node challenges, such as increasing parallelism to use the increased number of cores and threads; enabling (or not hindering) compiler vectorization with AVX-512 instructions; and identifying data structures that will benefit from residing in high-bandwidth memory and explicitly managing the memory hierarchy.
On-node parallelism. Cielo provided applications with 16 cores per node. With Trinity, the Haswell nodes provide 32 physical cores (each supporting Vectorization. Vectorization performs computation on multiple data objects with a single instruction, which can greatly improve performance and operation efficiency. Cielo's Magny-Cours had 128-bit registers to handle streaming SIMD extension (SSE) instructions to operate simultaneously on two 64-bit double-precision numbers. The Haswell processor, with advanced vector extensions (AVXs) has a vector width of 256 bits to allow simultaneous operations on four double-precision numbers. The KNL processor with AVX-512 instructions further doubles the vector length and can operate on eight doubleprecision numbers simultaneously. Compilers can autovectorize code that it considers safe for vectorization. But applications must be structured to take advantage of vectorization. Unfortunately, common coding practices are often inhibitors to vectorization, such as branching or data dependency within a loop. Language choice can also significantly affect how well compilers auto-vectorize. Codes written in traditional HPC languages, such as Fortran and C, typically vectorize better than those written in C++. Given the tremendous movement to C++ for scientific applications, close communications and codesign with vendor compiler developers are needed.
Memory hierarchy. The Trinity KNL nodes provide a unique feature with the use of both DDR4 memory and high-bandwidth memory (HBM). The HBM can be configured at boot time to be used as a pseudo L3 cache (cache mode) or a memory space explicitly managed by the user (flat mode). Explicitly managing memory can be complicated. If the code teams can identify and partition the data into blocks that fit in the HBM, using the HBM as a flat memory will provide the best performance. However, the ideal data structures can easily change based on the size of the simulation and the simulation physics over time. Facing the ever-changing nature of our application data, the cache mode is more accommodating as it automatically captures frequently used data. Similar to the flat mode, the cache mode can be ineffective as data size grows beyond the HBM capacity.
In addition to the new compute node capabilities, Trinity also provides 576 burst buffer nodes with 3.69 Pbytes (PB) of raw storage capacity and 3.28 Tbytes/s of bandwidth and 78 PB of Lustre filesystem storage with 1.6 Tbyte/s aggregate peak bandwidth. This increase in storage capacity and added storage hierarchy will challenge applications to use the burst buffer to reduce I/O overhead for checkpointing and enable new workflow capabilities within a simulation run, and manage massive amounts of simulation data more efficiently.
The composed Trinity system with nearly 20,000 nodes and both Haswell and KNL processors also provides new opportunities and challenges. Scaling simulations to utilize the massive size of the system for once unattainable calculations takes additional effort, as well as making long-running simulations resilient to hardware failure rates that are non-negligible at large scale. The heterogeneous composition of Trinity, with two types of nodes, will provide new workflow and simulation coupling capabilities that should be explored.
Trinity COE Goals
The contractual goals of the Trinity COE focus on porting and achieving performance on key ASC applications and to attain simulation scales that weren't previously possible. However, for application developers, productivity is as important as performance. COE participants strive to amortize their efforts across several generations of HPC Prior to HPC systems with GPGPUs and manycore processors like the Intel Xeon Phi processor KNL, application developers could achieve increased performance with minimal code development effort from the increased frequency and the number of cores on the node with each new system. platforms and have a broader set of goals to shape activities and best practices. Some of these goals include the following: ■ Create infrastructure for future code development activities such as refactoring codes with consideration for the KNL, as well as other foreseeable future computing hardware including GPGPUs, creating proxies for quick prototyping, and exploring abstractions.
■
Educate current and next-generation application developers to program with architectural evolution in mind.
Develop long-term relationships across SMEs from vendors, other labs, other departments, and application development teams to continuously share best practices and ease the transition to future architectures.
Impact future system design and procurement to consider the needs of ASC applications and code development cycles.
Trinity COE Co-Design Best Practices
The Trinity COE works with a variety of application developers, each with their own vision for transitioning codes to new architectures. To optimally benefit from COE interactions, application developers must commit time and effort to educate vendors on their problem space, iterate on possible solutions, and implement solutions into the main code branch. It's only with this level of commitment that the full impact of the COE and its codesign goal can be realized and applications can see long-ranging impact. There's no one-size-fits-all approach, although some best practices have been discovered by trying multiple activities.
Prototyping with Proxy Applications
The development and use of proxy applications for Trinity and future systems provides key advantages. The proxy application is designed to be freely distributable to vendors, collaborators, other laboratories, and universities. They reflect the design of the larger target applications, but because of the proxy-nature they may not be written as the most efficient source code for Trinity.
Proxies provide an agile testing ground to prototype new approaches prior to implementing them into larger production codes. This allows developers to make a low-risk investment on a high-risk approach. Although it isn't always obvious how changes in a proxy application can be implemented into a full production application with much more complexity, it still remains a good learning opportunity for both the application developers and vendor partners.
The COE hosted Peter Mendygral (Cray) to discuss his work on the development of WOMBAT, an astrophysics application 4 that employed highlevel OpenMP threading in a SPMD style, where each thread allocated its own memory, computed a disjoint subset of the computational grid, and even had each thread perform its own message passing. The development of the application was conducted closely with Cray's MPI group, and a major enhancement was made to minimize locking within the MPI library to enable the onesided messaging to perform asynchronously. This resulted in excellent scaling on the node using the high-level OpenMP with threads doing their own messaging. Given these results, Cray's COE team introduced MPI-3 RMA one-sided messaging into LANL's SNAP proxy. The work was completed at the end of 2016, and this year, the one-sided approach will be incorporated into the PartiSN, an important Trinity application.
This case study highlights the tremendous advantage of the COE and a co-design project that improved both application and vendor software. The cooperation between the application developers working with the Cray COE members allowed for a significant rewrite that benefited from enhancements to the vendor's software-in this case, MPI. Interaction within the COE is a two-way communication between the application developer and the vendor's development team facilitated by the vendor's COE member.
Integrated and Layered Collaborative Application Exploration
Collaborative application exploration, often called deep dive, hackathon, bootcamp, or the Intelcentric terms discovery and dungeon sessions,
The contractual goals of the Trinity COE focus on porting and achieving performance on key ASC applications and to attain simulation scales that weren't previously possible. However, for application developers, productivity is as important as performance.
September/October 2017 are invaluable opportunities to help prepare code teams to adjust to more complex hardware environments. These activities bring vendor SMEs together with application developers to identify where to focus application development and see the impact of real-time code changes with feedback and collaboration with tools and performance experts. The vendor SMEs bring domain knowledge about the hardware, compilers, and performance analysis tools. The DOE developers bring domain knowledge about the physics in the source code, as well as the application's architecture, design, constraints, and portability goals.
SNL developers have been enthusiastic participants in the COE and benefited from the iterative interactions with both Cray and Intel. Sandia's COE efforts focus on the Sierra Mechanics engineering analysis code suite.
5 Some of their joint activities and outcomes include the following: ■ profiling the various Sierra modules for performance bottlenecks, which included linear solvers for the implicit modules (such as Sierra's domain decomposition solvers and the Trilinos multigrid solvers-domain decomposition solvers depend heavily on a sparse direct solver that in turn depends on an efficient matrixmatrix multiply [DGEMM] kernel), kernel loops for matrix assembly and explicit dynamics, and specialized algorithms such as thermal enclosure radiation and contact mechanics; ■ vendor SMEs and laboratory computer science (CS) staff integrated into the development teams (weekly sessions covered a variety of collaborations such as additional training on vendor tools and deep dives into compiler and tools issues); ■ application discovery sessions using Intel's profiling and analysis tools, such as the Intel Vtune, Advisor, and Inspector; and ■ Intel-hosted dungeon sessions, multiday offsite deep dives hosted at the Intel facility in Hillsboro, Oregon, where teams focused on various aspects of the Sierra software stack such as Trilinos multigrid solver performance, domain decomposition solver performance, and finiteelement matrix assembly performance.
For the dungeon sessions, Sandia teams brought a combination of small and large proxy applications. The larger proxy applications helped expose the Intel SMEs to the complexities of real applications. One advantage to having the dungeon hosted at Intel's facility was the availability of a wide range of vendor SMEs who could be temporarily pulled into the dungeon to collaborate on issues, in particular Intel compiler and math library experts. Some compiler issues and DGEMM performance issues with small matrices were identified.
How to Create a Very Successful Bootcamp
In July 2016, prior to delivery of the Trinity KNL nodes, Cray held a bootcamp for 35 attendees from LANL, LLNL, and SNL at Cray in St. Paul, Minnesota. This three-day activity provided continuous access to 10 to 15 Cray software developers and benchmarkers to facilitate the attendees porting and optimization on 156 KNL nodes on Cray's internal software development system. It was the first time the attendees had access to a multi-node KNL-based system. The code teams were able to bring up their full multiphysics production codes, rather than single-node kernels, due to the preplanning undertaken by Cray to prepare for the event.
During the bootcamp, Cray SMEs gave presentations, but a bulk of the time was spent with hands-on access where Cray experts were able to sit down with attendees and immediately assist when questions arose. Notable results from the event include the following: ■ All codes were able to build and run with at least one compiler on the KNL processors and run on a single node (performance ranged from good to bad out of the box; more work to be done).
■
Approximately 80 percent of codes were able to run on more than one KNL node for scaling studies, and some applications were able to use more than 100 nodes.
Out-of-the-box performance improvements for some codes were compared to other existing production systems.
Proxies provide an agile testing ground to prototype new approaches prior to implementing them into larger production codes. This allows developers to make a low-risk investment on a high-risk approach. 
Early Access to Compilers and Rapid Feedback
Sierra is a very large, complex C++ code base from SNL that often poses problems for compilers and tools. Access to prerelease compilers and tools, interactions with vendors on compiler and tool issues, and testing of compilers and tools with full applications have proven invaluable for improving both compilers and tools. Historically, it has usually taken multiple compiler patch releases before Sierra can be successfully compiled, let alone passing all its tests. However, because of the COE effort, the initial release of Intel's latest compiler is able to successfully compile Sierra, and the binary produced is able to pass over 90 percent of its test suite. We continue to work with compiler vendors to improve the compiler's autovectorization capabilities that are crucial to performance on the KNL processors. Early interactions with the Intel compiler team (at the Intel dungeon and through early access and evaluation) also revealed performance issues with the multithreaded Intel Math Kernel Library (MKL). Some of these have already been addressed by Intel, and others are being worked on in collaboration between Intel and SNL.
Early access to hardware One of the tremendous advantages of the COE is that it allows the vendor's COE members to take key applications and test them on prerelease systems. In the Trinity case, as soon as hardware and software are available for testing, they are made available for COE members to test the most important Trinity applications. With this early access, important information such as performance and necessary software tuning can be provided to the application developers for a head start on understanding how best to utilize the new system.
COE Impacts and Benefits to Design Life Cycle
The requirements of the Trinity application developers had a significant impact on Cray's software development. As mentioned earlier, MPI was modified due to a direct requirement for optimizing an important application. Additionally, Cray's development of performance monitoring and memory management tools was a result of requirements that arose from discussions with Trinity application developers. Cray has significantly benefited from input from the COE teams over the last 10 years and as a result has been able to satisfy the needs of Trinity's application developers.
The COE provides the vendor a chance to understand application demands and apply transistors where they'll provide the maximum benefit. Through interactions with COE application teams, the vendor gains knowledge about what hardware features can benefit target applications. For example, if mixed precision is an important application feature, allocating transistors to transition quickly between mathematical precision could greatly improve application performance. However, gathering the requirements is only a very first step in designing the next-generation processor. Often, different applications have different requirements that can conflict with each other. For example, one application might require higher double-precision computation capability, and another might want more memory bandwidth. Processor designers then need to weigh the benefit and cost of all features and try to keep the design within a transistor and power budget. To complicate things further, there could be multiple solutions to the same problem-for example, increasing processor clock speed or increasing out-of-order buffer depth could also improve an application that requires mixed-precision arithmetic.
In the Trinity COE collaboration, close partnership with various national labs has helped Intel understand the advantages and disadvantages as well as the preferences for the various on-package memory modes in different applications, helping the company design memory subsystems for nextgeneration processors. The impact to application architecture is equally profound. Given inputs from hardware vendors, application teams can One of the tremendous advantages of the COE is that it allows the vendor's COE members to take key applications and test them on prerelease systems.
