This report describes trans-organizational efforts to investigate the impact of chip multiprocessors (CMPs) on the performance of important Sandia application codes. The impact of CMPs on the performance and applicability of Sandia's system software was also investigated. The goal of the investigation was to make algorithmic and architectural recommendations for next generation platform acquisitions.
Evaluation of the Impact Chip Multiprocessors have on SNL Application Performance

Executive Summary
Sandia has met the requirements of the ASC Level 2 Chip Multi-Processors Milestone (Milestone #3158). Investigations have had a major impact on 1) Red Storm upgrade to dual-core processors, 2) Red Storm upgrade to quad-core processors, 3) improved performance and scalability of the TLCC platform, 4) the acquisition and deployment of Red Sky, and 5) the development of the technical requirements for the NNSA Zia Capability Computing Platform.
This effort was a result of a multi-disciplinary effort with contributions from CSSE System Software and Tools, CSSE Advanced Systems, CSRF and LDRD programs. Staff included members of three Centers, 1400, 9200 and 1500.
Customers of this Milestone are NNSA/ASC HQ, CSSE program managers and platform design team members, IC program managers, and application developers.
Milestone Objective Success Criteria
The focal points in the Milestone were to investigate the impact of chip multi-processors (multi-cores) on the performance of important SNL application codes and the impact on the performance and applicability of SNL's system software. In addition, algorithmic and architectural recommendations were to be made for next generation platform acquisitions. Results were to be formally presented in a program review and documented in a report. All of these goals have been met.
In addition to an impact on platforms, this milestone made contributions to and had collaborations with the ASC algorithm and application groups. These collaborations led to development of the Mantevo mini-applications, multi-core capabilities in the Trilinos solver libraries, and performance improvements to the ALEGRA code.
CMP research and investigations associated with this milestone led to numerous publications, journal articles, invited talks/presentations and seminars. The Catamount Nway with SMARTMAP technology, part of this milestone effort, won a prestigious R&D 100 award in 2009.
To meet the success criteria, SNL worked closely with the vendor community to understand their processor roadmaps and technologies that are likely be deployed in the ACES Zia capability initiative. SNL worked closely with vendors to acquire early examples of technologies for evaluation. Also, computer architects have worked closely with application and algorithm developers, including Intel and the Portland Group compiler vendors, to research and develop techniques and methodologies to extract optimal performance from multi-core processors.
As well, criteria were developed and compared with benchmarking results where applicable to provide metrics for successful completion of this milestone. For Sandia's TLCC systems, these efforts had significant impact, a few of which are listed below:
• Performance analysis led to recommendation to use MVAPICH instead of OpenMPI • Processor and memory affinity analysis at scale led to deployment of a job-launch wrapper that uses numactl to do process and memory placement • Identification of the impact of cache coherency overhead on multi-socket node performance Within CSSE, the impact of multi-core spanned both hardware and software tasks. The system software team ensured that the Red Storm system software kept pace with the Red Storm dual-core and quad-core upgrades. In addition, interaction between the system software team and the performance analysis team was essential to understanding and quantifying the effect of processor and memory affinity on application runtime.
CSRF and LDRD projects led by Mike Heroux provided many of the algorithm contributions and were instrumental to the architecture and performance analysis teams. These projects provided mini-applications and consultations in understanding the impact of architectural features of multi-core at the application and algorithm level.
Future impact of this milestone will be evident in the next-generation Zia platform. For example, performance analyses of candidate processors have provided quantitative data for technology decisions. Also, the 6X application performance benchmarks for Zia are driving code teams to address higher levels of scalability and parallelism. Our research also identified mesh generation for large problems as a major application readiness issue to use Zia at scale. Resolutions to all of these issues will prove crucial for Zia and beyond.
Multi-core Algorithmic and Architectural Recommendations for Next Generation Platforms
• Quantitative analysis has validated the importance of "effective" memory bandwidth for the Tri-Labs applications, and continuing architecture evaluations are driving processor choices for Zia. Multi-core is increasing the gap between computation and memory performance (i.e. the memory wall).
• Processor and memory affinity control is essential for extracting maximum performance from multi-core architectures. As we make the transition from multicore to many-core (10's to 100's of cores interconnected at the silicon level by high-speed interconnects similar to today's HPC systems), this will become even more critical.
• The deployment of multi-core, and eventually many-core, in capability class platforms will drive much higher levels of parallelism for applications. The 6x application suite is forcing code teams to work these issues early and identify deficiencies, e.g., mesh generation for problem sizes necessary to fully utilize Zia at scale. Future machines will require at least two orders of magnitude more parallelism than is being deployed on current platforms. 
Executive Summary
publications, journal articles, invited talks/presentations and seminars. The Catamount N-way with SMARTMAP technology won a prestigious R&D 100 award in 2009.
Slide 2: Milestone statement
Milestone Statement
• "SNL will investigate the impact of Chip MultiProcessors (CMP) on the performance of important SNL application codes and the impact of CMPs on the performance and applicability of SNL's system software. This investigation will make algorithmic and architectural recommendations for next generation platform acquisitions, which will be documented in a report" • Completion Criteria: Program review and final document published as a SAND report.
This milestone is an FY09 ASC Level 2 for the CSSE subprogram. The multi-core work was encompassed by multiple projects and efforts within and external to CSSE.
Within CSSE, the impact of multi-core spanned both hardware and software tasks. The system software team ensured that the Red Storm system software kept pace with the Red Storm dual-core and quad-core upgrades. In addition, interaction between the system software team and the performance analysis team was essential in order to understand and quantify the effect of processor and memory affinity on application runtime.
Mike Heroux's CSRF and LDRD projects provided many of the algorithm contributions and were instrumental to the architecture and performance analysis teams in providing mini-applications and consultation in understanding the impact of architectural features of multi-core at the application and algorithm level.
Slide 4: Staff contributions
Staff Contributions
• 1422
Contributors spanned multiple 1400 organizations in addition to centers 9300 & 1500. The format of the review is to summarize the impact on platforms, applications, and external visibility as a result of the multi-core initiatives within CSSE. The amount of material that has resulted from this effort is much greater than can be communicated in the timeframe of this review, but highlights of some of the major contributions will be presented.
To demonstrate external visibility and community impact, a bibliography of publications, presentations and seminars is supplied as a separate appendix.
Slide 6: Impact on platforms
Impact on Platforms
• Red Storm -Dual-core, and quad-core AMD Opteron performance analysis contributed to the decision for two Red Storm processor upgrades and a memory upgrade -Catamount N-way deployed on production Red Storm, with support for up to 8 cores per node (only 1, 2, or 4 cores/node were fully tested) -Quantified that the Red Storm quad-core upgrade provides 38% throughput gain (over a suite of applications) at a cost of 8% over prior platform investment -SMARTMAP deployed on production Red Storm -Multi-core system software enhancements that allow full use of all cores and allow for user selectable memory/core requirements or chip architecture -Automatically set the optimal page size for dual and quad cores nodes, even when a single job is running on both types of nodes Red Storm has gone through two major upgrades over its lifetime. This includes two separate multi-core upgrades and numerous updates and revisions to the Catamount light weight kernel and associated runtime.
Many of the system software upgrades allowed applications to take advantage of the multiple cores without having to make any application changes. E.g. in going from a dual-core to a quad-core processor AMD changed the number of small and large page TLB entries. By automatically setting the optimal page size based on processor type, the application developer and/or analyst did not have to know what page size to specify for best application performance for each run.
The suite of applications for the throughput improvement measurement included CTH, SAGE, LAMMPS, POP, and ALEGRA.
Catamount N-way with SMARTMAP technology won a prestigious R&D 100 award in 2009. Early multi-core studies on CSSE test beds identified the need to ensure proper memory affinity for best performance with NUMA architectures. The need for proper affinity was made even more evident when analyzing application performance at scale on TLCC. The deployment of job launch wrapper scripts to force proper affinity on the TLCC platforms led to substantial performance gains and more predictable runtimes. The following slides demonstrate the effects of not using memory and processor affinity controls.
The AMD Barcelona processor demonstrates reduced memory subsystem performance in quad-socket configurations. This is due to a poor cache coherency protocol design that has since been fixed in the AMD Istanbul processor. The affects of the reduced performance were quantified by the CSSE application performance analysis team.
TLCC Allreduce Analysis Impacts Production MPI Choices (Rajan presentation for FOUS Qtrly Review) This chart demonstrates the reduced Allreduce performance demonstrated by OpenMPI. Note that the vertical axis is in Log scale, and at 1024 MPI ranks performance is approaching an order of magnitude performance degradation as compared to the MVAPICH results. These early tests by the CSSE performance analysis team led to the recommendation to use the MVAPICH MPI for production applications.
24
Slide 11: Scaling issues on TLCC Scaling Issues on TLCC were identified to be related to processor and memory affinity (Rajan presentation for FOUS Qtrly Review) The Mantevo HPCCG mini-application demonstrates the effects of improper memory and processor affinity control. Although this chart does not contain error bars, it has also been shown that the variation is dramatically reduced when using proper affinity. In this instance, at 1024 cores, by using memory affinity controls the runtime is reduced ~40% and follows a similar trend to the expected behavior as demonstrated on the other platforms. Using performance analysis tools, such as VAMPIR Trace, it is possible to get a visual representation of the effect of affinity control on runtime behavior. Note that this is only a 64 rank job! The effect is similar to load imbalance due to domain decomposition in parallel algorithms. Another analogy is an increase in system noise, which impacts some processors more than others, and hence reduces the per iteration performance of all the participating processors to that of the slowest node.
Impact on Platforms, Cont'd
• Red Sky This is a summary of the processor performance study used for the Red Sky platform acquisition. CTH, phdMesh, HPCCG and LAMMPS were the applications used in the RFP, and hence analysis was focused on those applications also. Stream and GUPs are indicators of memory subsystem performance. The QPI/HT results for stream and GUPs are quantifying the performance of the memory subsystem when using the respective inter-socket interconnects. This analysis showed the significant performance potential of the Intel Nehalem process. This study used an early evaluation workstation from Intel and the higher CPU and memory subsystem clock rates of Red Sky were expected to provide even greater performance improvements. This slide summarizes some early application results on Red Sky, relative to TLCC. These early results help to validate the architectural choices made during the Red Sky acquisition. Performance is 2.5x to 3.5x that of TLCC for job sizes up to 1024 MPI ranks.
It is expected that performance gains will be even larger at higher scales.
• Zia -Performance analysis of candidate processors will provide quantitative data for technology decisions -6x application performance benchmarks are driving code teams to address higher levels of scalability and parallelism
Zia is the NNSA's next generation capability computing platform. A design goal of the Zia platform is to achieve a 6x to 8x performance improvement over the NNSA's Purple platform. Sandia and Los Alamos have formed an alliance, ACES, which has the responsibility of delivering and deploying the Zia platform. Achieving this level of performance improvement over Purple will require a multi-core architecture capable of scaling not only within a node, but also able to take full advantage of the high-speed interconnect and ensure that all processes (cores) within a node have sufficient memory bandwidth.
Current performance analysis efforts include quantifying the performance of Zia's 6x application suite on candidate processors for the Zia platform. This effort will evolve to a full system analysis that takes into account MPI performance and other inefficiencies that can affect platform scaling. In this study, the performance for the Zia 6x applications is being measured on an Istanbul workstation using different memory subsystem clock rates. Regression analysis allows for a prediction of the performance for each application at a higher clock rate for the memory subsystem. This same analysis can be used to estimate the fraction of application time that is memory bound, and hence the fraction of time that is computationally bound. Knowing these fractions, it is possible to estimate Magny-Cours performance using a higher memory clock rate and a lower CPU clock rate. • Zia 6x Application Suite -Driving Charon & Sierra/Presto code teams to address much higher levels of parallelism than they can currently support -Identified mesh generation for large problems as a major application readiness issue in order to use Zia at scale.
As was stated in an earlier slide, much of the system software efforts are transparent to the applications and analysts, but have an impact in improved performance and quicker turn around times.
There has been an excellent collaboration between the architecture, system software, algorithms and application teams in the effort to ensure a seamless transition to multicore for the applications and analysts. The primary strategy for the architecture and performance analysis teams has been to work closely with the Trilinos group, as Trilinos is leveraged by numerous applications internal and external to Sandia.
In addition, the ALEGRA team has been collaborating directly with the Portland Group (PGI) compiler developers. PGI has been profiling and identifying areas of improvement for key ALEGRA kernels. This has been a two-way collaboration, in that performance issues and bugs with PGI's compiler are being feed back to their engineering team and being incorporated in future releases.
In defining the Zia 6x application suite problem sets, and collecting baseline data for the Purple platform, it has become evident that a major limiting factor in transitioning codes to Zia will be the ability to generate appropriate meshes of a size that will fill the Zia platform.
External Visibility & Community Impact 
Current & Future Efforts
• General Purpose GPUs -impact on physics and engineering codes still unclear -higher levels of integration at the silicon level will change this -GPU architecture evolution will become more general purpose
• Massively multi-core general purpose CPUs -alternative cache and memory hierarchies -software controlled cache/fast memory -topology on a chip -integration of special purpose functions, e.g. GPU -programming models?
• Continue expansion of test beds and development platforms
This effort does not end with the completion of this milestone. In fact, it's most likely only a beginning. The most challenging transition will be to massively multi-core processors which incorporate finer levels of granularity on chip and begin to introduce many of the architectural issues associated with full scale machines of only a few years ago.
Slide 23: Summary
In Summary, multi-core findings are having an impact on current platforms
• Sandia has successfully made the transition to using multi-core technology for it's primary platforms: Red Storm, the TLCC clusters and Red Sky -Red Storm quad-core is providing equivalent performance, for a given job size, as the initial single core configuration, with a corresponding ¼ reduction in resources -TLCC performance at scale has been improved with the deployment and availability of affinity control mechanisms at job launch -Red Sky is being successfully deployed and demonstrating 2.5x to 3.5x the performance of TLCC, which will provide a significant increase in productivity for capacity workloads
• This transition was made possible by significant contributions from the CSSE system software, architecture and performance analysis teams in addition to leveraging the expertise of the algorithm and applications (IC) groups, CSRF and LDRD projects -Catamount N-way with SMARTMAP has allowed current codes to maintain an MPIeverywhere programming model. -The architecture group worked closely with the algorithm and application groups to characterize multi-core architectural impacts on performance. These interactions were essential in allowing multi-core capabilities to being introduced into the Trilinos framework.
Slide 24: Multi-core algorithmic and architectural recommendations for NGS Multi-core Algorithmic and Architectural Recommendations for Next Generation Platforms
• Even more important is memory locality. Processor and memory affinity control is essential for extracting maximum performance from multi-core architectures. As we make the transition from multi-core to many-core (10's to 100's of cores interconnected at the silicon level by high-speed interconnects similar to today's HPC systems) this will become even more critical. Affinity has an impact at scale, in addition to the node level. It is necessary to continue to research the effects of affinity on communications.
• The deployment of multi-core, and eventually many-core, in capability class platforms will drive much higher levels of parallelism for applications. The 6x application suite is forcing code teams to work these issues early and identify deficiencies. E.g. mesh generation for problem sizes necessary to fully utilize Zia at scale. The algorithm and application groups need to be defining methods to cope with finer levels of granularity. Future machines will require at least two orders of magnitude more parallelism than is being deployed on current platforms.
• The MPI-everywhere programming model is still effective, but it will most likely not be sufficient for many-core architectures. Two-level programming models will be necessary in order to extract performance from future many-core architectures.
Slide 29: Supporting slides
Supporting Slides
