In 2014, Lawrence Livermore National Laboratory began acquisition of Sierra, a pre-exascale system from IBM and Nvidia. Its use marks a significant shift in direction, introducing the concept of heterogeneous computing via GPUs. An overview of LLNL's overall application strategy focuses on how it's ensuring a successful transition of mission-oriented applications into the exascale era.
H igh-performance computing (HPC) application developers face a broadbased paradigm shift in system architectures as we march toward exascale. The need for increased performance continuessomething to which we've grown accustomed-but technology concerns related to power, parallelism, and resilience now complicate HPC software and application development.
The process of co-design was born out of the realization that exascale performance will require broader coordination within the HPC ecosystem. Technologies such as massively multicore CPUs and heterogeneous architectures answer the need for hardware speed but require application developers to handle the commensurate complexity of massive finegrained parallelism. Developers must also consider that the memory wall-how CPU speed increases have far outpaced the performance of memoryhas finally manifested itself as data motion replaces floating-point performance as the typical application bottleneck. Therefore, application developers must now deeply understand diverse architectures, while architects of hardware, the supporting software stack, and programming language standards must deliver innovations that ease this transition. The relatively stable architectures and programming models of the past 25 years allowed application developers and users to focus primarily on scientific discovery and less on the disruptive churn caused by continuous change in the underlying technology. The end of this trend could require major changes in these often large, complex, and validated applications that divert these users from their primary purpose.
September/October 2017
At Lawrence Livermore National Laboratory (LLNL), we field a variety of cutting-edge systems for the Advanced Simulation and Computing (ASC) program of the National Nuclear Security Administration (NNSA) and similar systems acquired through institutional funding. This article focuses on the ASC Program's Advanced Technology (AT) and related LLNL institutional systems, which, by definition, stretch the achievable level of capability and thus present the most challenging environments.
The Importance of Application Modernization Application modernization entails not only preparing for the next AT system but also reducing the effort required to prepare for subsequent ones. This process reduces the cost required to complete mission-critical work into the longer future while ensuring its success.
Mission-Oriented Application Development
The ASC program has long been a leader in deploying top-tier HPC systems. These systems and the applications that run on them form a cornerstone of the US science-based stockpile stewardship program aimed at maintaining a safe, secure, and effective nuclear deterrent in the absence of underground testing. ASC is an integrated program that purchases and deploys the platforms, supports the facilities, and develops the integrated codes that both simulate a variety of other defense missions and basic material science capabilities down to first principles. It leverages experimental facilities for robust verification and validation, and builds the foundational computer science and mathematics that connect these elements.
Between LLNL and the Alliance for Computing at Extreme Scale (ACES), which consists of NNSA's other national laboratories, Sandia National Laboratory (SNL) and Los Alamos National Laboratory (LANL), ASC deploys two new flagship systems every five years alternating between LANL and LLNL. The effective advance of the NNSA's science-based stockpile stewardship mission defines the success of these platforms. Thus, our strategies for application development are critical.
LLNL and ASC's Two-Pronged Application Strategy
The NNSA and US Department of Energy (DOE) are actively defining the upcoming exascale era. Application developers are assessing whether its technologies will represent a monumental shift that won't sufficiently support today's applications, at best achieving a small fraction of the potential performance. This concern combined with recent selections for AT systems led to the formation of a new ASC program element, Advanced Technology Development and Mitigation (ATDM), in 2014 to pursue longer-term mission goals. ATDM provides a once-in-a-generation opportunity for the NNSA labs to develop new mission-oriented applications that exploit advanced technologies as the details of the architectures emerge.
NNSA application users are accustomed to highly validated applications, so they don't consider an application better just because it may run (even much) more efficiently on future systems. Replacing a production application with a new from-scratch one is a decadal process that must provide a similar level of trust. Thus, ASC is pursuing a two-pronged strategy to modernize existing, trusted applications and to create modern ones that exploit exascale architectures.
This approach not only supports our enduring mission but can also provide insight into the tradeoffs other organizations must consider in the future. In particular, what benefits does new fromscratch application development for the exascale era provide over evolving existing trusted applications? Fully answering this question will take 5 to 10 years, but early indicators will emerge as production use of GPU-based and many core systems begins.
A clean-slate approach to application development.
At LLNL, we expect MPI and OpenMP to remain sufficiently scalable and agile to address exascale systems, and we're exploring high-order methods that increase the floating-point operations per memory access of our workloads. We've previously tried these methods, but when floating-point units were the primary bottleneck in HPC systems, they often didn't result in faster solutions compared to traditional lower-order methods. However, with data motion increasingly becoming a performance bottleneck and the need for extreme fine-grained concurrency, this approach has shown promising early results.
Modernizing our existing production code base. ASC and LLNL are also working to ensure that our current production applications can run effectively on future advanced architectures. These production applications must remain stable for use in satisfying national security-related deliverables during a disruptive transition. The rest of this article focuses primarily on these current production applications (and their existing underlying mathematical algorithms), as they must run immediately on Sierra and are the main focus of the Sierra Center of Excellence (CoE). While CoEs work closely with ATDM next-generation applications, their time frame to begin production use is several years after Sierra's initial deployment.
ASC's AT Systems Push the Limits ASC's Platform Strategy 1 outlines the NNSA platform acquisition strategy, which deploys two classes of systems. Commodity Technology (CT) systems provide significant computing power for the day-to-day workload while minimizing software changes. These systems, typically x86-based Linux clusters, run a familiar software stack and support small-to medium-scale jobs. Alternatively, AT systems represent the leading edge of the HPC market. These platforms run the most challenging large-scale ASC simulations and provide national leadership in the development of hardware and software technology. Successful AT technologies often "trickle down" to future CT systems once the technology is well-understood, affordable, and no longer a significant challenge to the majority of our applications.
In 2014, LLNL joined Oak Ridge (ORNL) and Argonne (ANL) National Laboratories in a joint request for system proposals under CORAL, which is the Collaboration of Oak Ridge, Argonne, and Livermore (https://energy.gov/downloads/fact -sheet-collaboration-oak-ridge-argonne-and-livermore -coral). Each of these laboratories was planning to deploy systems in the 2017-2018 time period, and the CORAL process selected the pair of systems that provide the best overall value to the DOE, with the DOE Office of Science Leadership Computing Facilities at Argonne and Oak Ridge selecting competing systems (from Intel and IBM), respectively, and LLNL selecting the IBM system for its DOE NNSA mission.
Procurement of an AT system begins years before its delivery. This choice brings significant uncertainty at the far reaches of technology roadmaps. However, it provides essential time to prepare for these cutting-edge systems that often entail significant disruptions, such as with Sierra.
Lessons Learned from Previous AT Systems
From the start, LLNL worked closely with IBM to develop and to deploy the BlueGene series of systems. The first generation, BlueGene/L, represented a major HPC architecture shift by using low-powered CPUs to save energy and orders of magnitude more processors. The BlueGene line culminated in the development of BlueGene/Q, the largest instance of which was installed at LLNL in late 2011 and early 2012 with 1.6 million cores and 1.6 Pbytes of main memory. While BlueGene represents a significant leap forward in scaling of software and applications, its dominant programming model, MPI, remains largely familiar. Each core has 1 Gbyte of memory, which is tight by historical standards but doesn't require a wholesale shift in application development.
The BlueGene line, culminating in Sequoia's massive core counts, forced LLNL developers to learn how to scale out applications. O(N) algorithms and data structures (where N is the number of MPI processes) become performance or memory bottlenecks and were redesigned. Applications must also exploit BlueGene/Q's hardware threads and vector units to maximize performance. However, as difficult and time-consuming as these changes are, they're generally well-understood optimizations. Application developers apply similar techniques with modest two-to-four-way threading or even the familiar MPI-everywhere paradigm. BlueGene/Q's design, with uniform memory access within a 16-core node, further simplifies the porting process. LLNL applications were largely able to scale to all 1.6 million cores of Sequoia and are building on those capabilities for Sierra.
Following Sequoia, ACES fielded the Trinity system at LANL during 2015 and 2016. It includes standard Intel Xeon nodes in one partition and Xeon Phi Knights Landing (KNL) nodes in another. The partitions have similar node counts, but the KNL partition offers much higher aggregate performance potential. The KNL CPU requires modest shared-memory programming (threading) for memory-intensive applications that can't squeeze into an MPI-everywhere mode. Its wide 512-bit AVX vector units put the onus on applications and compilers to maximize vectorization, an art perfected in the NNSA labs during the era of Cray machines but largely lost in the interceding years. Trinity includes a novel burst buffer 2 for optimizing the I/O bottleneck and features that support increased power-aware computing. However, the most novel aspect of porting applications to the KNL nodes is the addition of on-package high-bandwidth memory alongside standard higher capacity DDR4 DRAM. Applications with significant memory capacity requirements (such as those common to the ASC mission) must manage memory placement and movement within the deepening memory hierarchy.
ACES deployed the Trinity CoE to address these new challenges about one year before the Sierra CoE was created. The Trinity and Sierra CoEs have evolved side by side as each identifies effective mechanisms to bring detailed and early knowledge of the architectures to our application developers. In addition, many of the applications involved in these two NNSA-based CoEs participate in both activities to help ensure the goal of performance portability by developing application techniques that don't assume a specific architectural or vendor solution but still achieve high performance.
GPU-Based Systems at LLNL: Why Now?
With the Sierra acquisition, the complexity for our application programmers takes another sharp turn by introducing heterogeneous computing via GPUs. While not entirely new to the NNSA laboratories, which had deployed the heterogeneous RoadRunner 3 system with a combination of x64, PowerPC, and Cell processors as the first petaflop system in the late 2000s, Sierra is the first heterogeneous NNSA AT system that faces the production computing requirements that we discussed earlier.
Unlike prior generations of Nvidia-based heterogeneous systems, such as Titan at ORNL, the Sierra architecture (and ORNL's upcoming Summit system, which is part of the same procurement) addresses some of the complexities of heterogeneous computing through NVLINK technology developed in partnership between IBM and Nvidia. Previously, GPU-based systems used a PCIe bus to connect the CPU and GPU, and required the application developer to manage data transfers between CPU and GPU memory explicitly using programming models that aren't widely available on other platforms, such as CUDA or OpenACC. Sierra's nodes will provide higher bandwidth data transfers and the appearance of a unified memory space. While CUDA has long supported the unified memory concept in software, Sierra's hardware solution will automatically map memory pages into and out of the high-bandwidth GPU memory. Even if performance requires explicit data management within the node's memory hierarchy, the problem becomes the same as that faced with manycore systems such as Trinity. Further, the emergence of OpenMP 4.x offers a performance portable solution that supports application restructuring once for both GPUs and fine-grained threading.
In addition to heterogeneity in both compute and memory discussed earlier, the Sierra system will include a large solid-state drive (SSD) on each node. Initially, this SSD will be used as a burst buffer to help alleviate the I/O bottleneck by offloading checkpoint data to intermediate local storage at much higher speeds than writes to traditional parallel file systems can be performed, with all or selected subsets of the data moved to the parallel file system asynchronously from the main application. The effective use of burst buffers, both on Trinity and Sierra, is another challenge for applications. LLNL's Scalable Checkpoint-Restart (SCR) library 4 will provide a portable layer that intercepts standard I/O calls and transparently diverts them to the burst buffer, while providing additional support for resilience by replicating checkpoint data in the event of failure of a single MPI task.
The combination of these features with the apparent inevitability of heterogeneous computing in the upcoming exascale era convinced LLNL and the ASC program to make the leap into GPU computing. Thus, we're undertaking a multiyear effort to modify our applications to support this architecture. While we've already tackled the challenges of extreme MPI process counts, and the Trinity CoE is actively addressing threading, vectorization, and multilevel memory spaces, Sierra's heterogeneity represents a significant challenge to exploit the massive parallelism now available on a node and not conducive to MPI-everywhere. Thus, we formed the Sierra CoE in close conjunction with our partners at IBM and Nvidia as a focal point of our porting and optimization efforts.
The Sierra CoE at LLNL We began to prepare ASC applications within the Sierra CoE immediately following the selection of the IBM/Nvidia platform. Thus, we had approximately four years before the system would be available for the classified ASC mission. However, with no firm playbook for how best to tackle the challenges of standing up a CoE, we took a measured and systematic approach, while also remaining flexible and adaptive in its execution.
What Is a CoE?
The term Center of Excellence is used throughout several industry sectors to define a focused effort around a challenging problem. In the context of advanced computing within DOE, it captures an organized effort that includes training, sharing of best practices, pursuit of impactful research, and, most importantly, refactoring of our large applications in preparation for new, complex platforms that must be used effectively immediately once they're accepted-or even before.
In a key component, the Sierra CoE includes experts from IBM and Nvidia in the effective use of their systems. When successful, this shared fate model allows laboratory staff with deep expertise in algorithms and applications to leverage deep vendor partnerships and their extensive knowledge of the hardware and programming environments. Conversely, the vendors obtain early insight into how their systems will be used, years before general sales. With their access to the hardware and system software engineers in their organizations, vendors acquire knowledge that can impact the quality of the system and its software, and inform future roadmaps.
The ASC CoEs have become a foundational tenet for the broader ASC co-design strategy. 5 The primary goal of co-design is to couple application, software, and hardware development to optimize utilization of HPC resources, and CoEs are natural outlets in which these activities can flourish. Codesign centers, such as those funded through the DOE Exascale Computing Project, are generally aimed at influencing hardware and software that hasn't yet been finalized in vendor roadmaps, taking a broader abstract view when possible, whereas CoEs are defined in the context of a single vendor offering that's relatively well defined. That said, the long lead times of major DOE procurements do allow for co-design between the labs and the vendors, particularly in the development of software.
The Need for Performance Portability While CoEs focus application modernization on a single upcoming platform, NNSA applications must be able to run on all NNSA platforms (at least) including AT and CT systems. While AT systems are sited at LLNL or LANL, time on the systems is shared equally across NNSA laboratories via secure long-distance networking. Similarly, other DOE applications must run on a wide range of platforms, such as those at ORNL, ANL, and NERSC at Lawrence Berkeley National Laboratory (LBNL). Thus, performance portability is an important application requirement.
Performance portability requires that optimizations for a particular platform do not preclude the application's use on other platforms. Ideally, optimizations will provide similar performance benefits across the range of systems. In April 2016, six DOE laboratories and four participating vendors in upcoming procurements (IBM/Nvidia and Intel/Cray) met on this topic (https://asc.llnl.gov /DOE-COE-Mtg-2016). Current and planned approaches to achieve performance portability were presented, along with emerging standards, tools, and programming models. More than 100 DOE experts participated, and similar meetings are planned in 2017. This article presents several of LLNL's directions that emerged from that meeting.
Dealing with a Secure Environment Sierra will ultimately be deployed in a classified environment to run applications that are accessible only within it. Thus, we required the Sierra CoE to be more than a "virtual center" and, instead, physically collocate vendor participants with our staff working in that environment. IBM and Nvidia staff are now physically located at LLNL, where they have access to the same environment as our application developers. While some work can be performed in less restrictive environments and thus include a broader set of expertise, full impact of those efforts requires that a range of CoE participants can transfer their lessons to the secure environment.
Aiming toward the Unknown Several challenges that immediately confronted the Sierra CoE defined its early efforts. First, our application teams had little experience with GPUs. Second, existing systems, even with GPUs, significantly differed from the Sierra architecture in key aspects. In particular, support for OpenMP 4.0, which was released only shortly before our system selection, wasn't yet available. While the CORAL procurement includes work to develop these compilers, and interaction on them has been a success of the Sierra CoE, we couldn't immediately use the programming models that are our ultimate target.
The CoE addressed the first issue during its first six to nine months through training focused on programming GPU-based heterogeneous systems. Several deep-dive workshops were held at LLNL on topics such as hardware details, geared to the perspective of domain scientists. Other deep dives focused on algorithmic areas such as deterministic transport or hydrodynamics so CoE participants could explore common approaches across applications.
Jointly with IBM, Nvidia, and the ORNL Summit CoE, we developed an internal white paper, entitled "Programming Strategies for Sierra, Summit, and Beyond." This white paper contains significant details of future systems that remain under a nondisclosure agreement (NDA), thus we haven't yet made it public. However, we touch on many of its key themes and expect that once Sierra becomes generally available and restrictions are lifted, the paper will help guide the application modernization strategies of other application teams and centers.
After the early training phase, we focused on the second challenge: how to proceed without some of the key tools. This next phase established our long-term programming strategy, largely built around the RAJA (http://github.com/LLNL /RAJA) abstraction layer for C++ and OpenMP 4.x. We often began with smaller proxy applications that attempt to capture the essence of core algorithms and programming styles. Thus, we could use existing tools on available hardware such as x86-based clusters with GPUs attached via PCIe.
Finally, we've recently begun to transition our large multiphysics applications, the ultimate goal of the CoE. The timing coordinates this key activity with the emergence of beta compilers and hardware, which we discuss next, minimizing the repeated need to rewrite applications that must simultaneously retain their production-use status on Sequoia and other current platforms.
The phased strategy outlined in Figure 1 represents an overarching focus for the CoE, which is the different application teams engaged in CoE activities at different time frames based on their readiness level and complexity. Over time, a strategy has emerged that allows applications to use the tools and hardware on hand to advance incrementally toward the end goal of a performance-portable solution. At a high level, Figure 2 illustrates the following approach that the application teams have generally taken: ■ Understand how the code could be written using established GPU programming models (typically CUDA). This step establishes a baseline for target performance and informs data structure and algorithmic changes that improve GPU utilization. ■ Implement the code using a portable approach such as RAJA, OpenMP 4.x, or a domain-specific abstraction for that application. This step provides an understanding of the performance implications of using the portable approach relative to the baseline.
■
Finally, test and generically optimize performance on the alternative platforms such as standard CPUs and manycore architectures such as Xeon Phi. This final phase obtains a truly performance-portable solution.
For the dozens of applications involved in the Sierra CoE, the approach has been highly customized, but the above framework has served as the basic skeleton that many have followed.
Planning Activities: Strategic Roadmap and Tactical Periods More than a dozen ASC application teams participate in the Sierra CoE. However, we couldn't provide simultaneous in-depth short-term expertise to all of them. At the start of the CoE, we created a roadmap for when the CoE would engage each team. The timings reflected our estimates of when key technologies, including beta compilers and early access systems, would be available. We also positioned applications based on their complexity, size, and ability to use existing technologies prior to availability of OpenMP 4.x compilers or early access systems. The roadmap has largely remained unchanged. The Sierra CoE is a funded collaboration between LLNL, IBM, and Nvidia. To facilitate the contractual aspects, we define a work plan every six months for the next half-year. The plan defines milestones with written reports that IBM and Nvidia deliver to document the efforts for that period and to provide lessons to others who might not be directly involved in them. We find that six months is an appropriate time frame for periodically prioritizing activities since it's short enough to support realistic goals but long enough that projects with momentum can make significant progress without disruptions for documentation and planning.
Advanced Architecture and Portability Specialists LLNL has embraced the CoE model of close collaboration between architecture and optimization experts and our domain scientists on application teams beyond vendor participation. Thus, we formed a new team of internal experts, our Advanced Architecture and Portability Specialists (AAPS) team, designed to bridge the mission-oriented world of production codes with the plethora of technology emerging from research pipelines. The AAPS team comprises LLNL staff who have expertise in all areas of application development, hardware trends, and emerging programming models, skills often derived from their participation with vendors in initiatives such as FastForward and DesignForward (www.exascaleinitiative. org) and AT procurements, as well as their participation in the broader research community helping to identify promising trends for exascale computing.
Similarly to short-term vendor engagements with application teams, we assign AAPS staff for short-to medium-term periods when an application team begins to transition the code base. Unlike application team members, the AAPS team is unencumbered by day-to-day production development demands. Thus, they can remain focused on the long view of experimenting with new programming models, compilers, and performance analysis tools.
The AAPS team works closely with existing application developers to transfer knowledge and to identify common needs and lessons learned across our application teams. The AAPS team holds weekly meetings to share progress. The meetings often include interested observers and informal work-in-progress talks. This highly effective model has built a broad knowledge base of the challenges of Sierra and exascale computing in general within LLNL.
Hack-a-Thons
The CORAL procurement is funding development of IBM's proprietary XL compilers and an open source LLVM implementation that will support OpenMP 4.x on Sierra. To provide application developers with early OpenMP 4.x training and to guide priorities of that compiler development, the IBM compiler teams and the Sierra and Summit CoEs held a series of OpenMP hack-athons. Multiple application teams brought small applications or proxies and worked directly with compiler experts for multiple days. These sessions have guided OpenMP coding strategies and have identified key refinements and extensions for that language.
When problems arise, as invariably occurs with beta software such as the OpenMP compilers, fixes or workarounds can be implemented immediately. This highly focused environment, separate from day-to-day distractions of all participants, fosters rapid discovery, learning, and implementation unlike any other CoE mechanism. A well-run hack-a-thon takes weeks of preparation and creates significant follow-up work for all participants. Thus, hack-a-thons must coincide with the availability of sufficient advances. Other CoEs, as well as the DOE FastForward and DesignForward programs, have also successfully used the hack-a-thon concept, which is rapidly becoming a co-design best practice.
Early Access Systems AT systems are typically early instances of new architectures, frequently serial number one systems. Preparing applications for them involves a tension between the deployment of cutting-edge technology and the mission need to use the systems effectively as soon as they're accepted. Early access to similar technology is a key mechanism to ease this tension. For Sierra, we immediately procured additional systems with GPUs, including a Cray system that provided access to the earliest GPUenabled OpenMP 4.x compiler. More importantly, in late 2016, we accepted multiple IBM/Nvidia early access (EA) systems.
The IBM/Nvidia systems consist of Power8+ CPUs and Nvidia Pascal GPUs, with a Mellanox EDR InfiniBand network connecting these compute nodes. This hardware provides a direct precursor of Sierra's hardware, which will include Power9 CPUs and Nvidia Volta GPUs (the next generation of each corresponding component) connected by the next generation of Mellanox EDR InfiniBand switches and network adapters. These EA systems provide the Sierra CoE with critical resources to experiment with key technologies, including the first access to NVLINK technology. Overall, these environments match the eventual Sierra environment to an essential degree that can't be matched with other available systems.
Expanding beyond the ASC Mission: The Institutional COE We originally limited the Sierra CoE focus to the specific needs of the ASC program. However, that focus failed to include many other LLNL applications. Although Sierra will transition to the classified computing environment soon after acceptance for exclusive use by the ASC program, we've historically used institutional funds to purchase smaller versions of our AT systems for use in our unclassified environment. This procurement leverages the same contract as the ASC systems to provide a capable open system for research and development of both unclassified libraries, and open science applications that form the basis for many important LLNL programs. For example, the Vulcan BlueGene/Q system is a 25 percent clone of Sequoia that provides 5 Pflops of capability for non-ASC programs and ASC collaborators such as our Academic Alliance Program.
Based on this precedent, those applications face the same paradigm shift as ASC applications, and thus also need a multiyear application modernization effort. We responded to this need with an Institutional CoE (iCoE). A cross section of managers from all major LLNL programs selected applications for iCoE participation. Application areas include seismic modeling, material science and basic physics, engineering, the National Ignition Facility (NIF), and open source LLNL libraries used by many applications at LLNL and around the world.
Of particular interest to IBM and LLNL is the exploration of data science and machine learning in the context of the iCoE. IBM's HPC roadmap has a data-centric approach that dovetails with its strategy for traditional HPC needs epitomized by the ASC program. The iCoE provides an ideal mechanism to explore this emerging field in depth. LLNL and the ASC program both expect simulation increasingly to involve techniques beyond traditional HPC, such as machine learning, to guide simulations to be more robust, predictive, and efficient. Thus, planned LLNL/ASC initiatives around cognitive computing will heavily leverage iCoE activities to implement extreme-scale neural networks. For these reasons, our initiatives are exploring neuromorphic computing as possible accelerators in the exascale era.
Performance Portability Strategy: RAJA and OpenMP4 As we discussed earlier, LLNL application efforts must target performance portability. Many CoE efforts focus on porting existing C++ applications to use RAJA and on the continued development of OpenMP standards to support accelerators in versions 4.5 and beyond.
The RAJA Portability Layer RAJA uses C++ abstractions (primarily template programming and lambda functions) to hide execution details of loops or kernels behind a simplified interface. RAJA's goal is to provide applications with a mechanism to use these abstractions with minimal disruption to the original coding style. RAJA includes a set of header files for compile-time choice of a backend programming model, and a small library of routines to optimize key operations on different hardware. However, RAJA entails a programming idiom more than a specific library.
Application teams from the outset have customized RAJA to match their application's look and feel. A community of RAJA users at LLNL and other institutions contribute to the base of knowledge and source code. They often take RAJA in new directions that they incorporate into the code base. The philosophy of keeping RAJA simple has allowed it to flourish, as application developers can understand its internal workings sufficiently while also relying on others to optimize and extend its functionality.
RAJA constructs are usually introduced in applications at the level with the most unexploited latent parallelism: mainly loops over mesh entities such as elements/zones, nodes, particles, or anywhere else that we might introduce fine-grained threading or a GPU kernel. While a large application can include many thousands of loops, they typically have only tens of loop patterns. Once abstractions of those patterns are identified throughout the application, their optimization greatly simplifies optimization of the overall application.
RAJA fundamentally separates the concepts of the loop body and its traversal method. Traditional for loops in C++ mandate the iteration pattern in the order between its loop control bounds. RAJA introduces the concept of a forall statement that includes a traversal template and an execution policy as a template parameter. The loop body is represented as a lambda function in C++. The execution policy defines a platform-specific approach to parallelization of the loop kernel, and can target CUDA, OpenMP, OpenACC, and other threaded programming models. The template description captures these details in a single header file. The concept of index sets supports customization of the order in which data is traversed. This powerful abstraction simplifies exploration of optimizations with minimal changes to the original code.
A simple example shows a C style for a simple linear algebra loop operation and one reduction in Listing 1 and the corresponding RAJA loop in Listing 2. The loops exhibit several key differences:
■ a call to a traversal template method (RAJA:: forall) with the loop execution policy as a parameter replaces the for-loop construct; ■ the loop body is passed to the traversal template as a C++ lambda function; ■ the reduction variables become RAJA template objects with a reduction policy and type; and ■ the loop body remains largely unchanged other than using a RAJA reduction operator in certain cases (such as min).
Additional details of RAJA are beyond this article's scope. Visit the RAJA GitHub site at http://github.com/LLNL/RAJA for links to the latest documentation and technical reports.
OpenMP 4.x and Beyond
While RAJA provides a philosophy to abstract node-level parallelization in C++, the current abstraction can limit its use to relatively small code sections, typically just to iterative loops. Further, it doesn't directly address parallelization of applications written in Fortran (or, to a lesser extent, C). Thus, we still require a programming language that can parallelize larger regions and, in particular, offload them as GPU kernels that also meet our performance portability goals. When we selected the Sierra architecture, we considered OpenMP 4.0 and its expected subsequent refinements to be the best candidate, as recent work has confirmed. 6 We did have a significant concern with the approach of using OpenMP. Specifically, as we've already mentioned, its device constructs were only recently added, and compilers with support for GPUs weren't yet broadly available. While we addressed the specific concern of compilers for Sierra through CORAL funding, this lack also meant that OpenMP hadn't yet been broadly used with GPUs and that the model would likely require refinements to meet our needs. Fortunately, OpenMP is an open organization that's committed to providing a usable and ubiquitous specification. Key personnel from LLNL and IBM have influential roles in that organization, so we have high confidence that we can address this concern.
Our confidence is already proving to be well founded. As we discussed earlier, hack-a-thon sessions identified key OpenMP refinements and extensions. 7 Our close working relationship with the IBM compiler team provides an ideal environment that not only helps identify issues but also helps prototype solutions, which simplifies their adoption. In fact, the OpenMP Language Committee has already adopted several of our refinements and extensions for OpenMP 5.0. Specifically, several changes simplify the use of device constructs, for example, by adding implicit declare target constructs. Further, critical issues with using those constructs with C++ have also been addressed. We expect OpenMP 5.0 to include additional solutions identified and tested, in part, through the Sierra CoE. Specific topics that we're exploring include deep copy for C++ objects, improved support for device memory, and even mechanisms to make the RAJA philosophy directly available in Fortran and C.
A s we plan and site first-of-a-kind systems, we face the difficult goal of effectively using them as soon as they're available. Our long experience with fielding unique systems has led us to employ tight collaboration between our application teams, our system specialists, and our vendor partners. While this strategy isn't completely new-we had weekly meetings between similar groups for BlueGene 8 -we've refined the strategy in the Sierra CoE to be even more focused on the necessary step of application modernization through the involvement of laboratory and vendor personnel working side by side on the applications of interest toward common goals.
Our refinements are necessary. The goal has become yet more difficult as we now not only require that our applications run well on systems immediately but also that the effort required to achieve that goal should provide similar benefits for future systems. HPC application teams have long desired performance portability, but it has proven elusive as systems evolve rapidly. Nonetheless, we expect that our strategy of close, even embedded, collaboration with architecture experts combined with the RAJA philosophy and industry standards such as OpenMP will place us much closer to achieving it.
As the lessons learned from our work in the various DOE CoEs manifest themselves in the successful deployment of applications on Sierra and other pre-exascale systems, those technical lessons will be shared with the broader community, both in the form of optimized software and libraries, IBM and Nvidia documentation, and through publications and public outlets such as the DOE Performance Portability Workshops.
