CAM
Kernels from an atmospheric benchmark suite
MF24
Livermore Loops; 24 do-loops; Each loop carries out a different mathematical kernel
EPCC

OpenMP benchmark to evaluate the overhead of OpenMP
Eight tests were run with these benchmark codes comparing the POWER7 (P7) chip to both the POWER5 (P5) chip and the BlueGene (BG) architectures.
Test Name What was tested Summary of Results
P5_P7
OpenMP Overhead For two threads, generally P7 showed good speedup, however for 4 and 8 threads, P7 had poor performance P5_P7.CRYSTAL Effectiveness of MASSV on P7
Overall, P7 showed good speedup.
P5_P7.IRS Effectiveness of the 12 prefetch streams that P7 has
Overall, P7 showed good speedup P5_P7.UMT • How well the compiler can unroll the loop • Generate SIMD instruction *without* actually vectorizing the loops
• Compare C and
FORTRAN
• Overall, P7 showed good speedup.
• Note that this benchmark has been tuned for BG machines, and the vector units on those platforms, and even so, with different chips and the associated different overheads, the P7 did well.
• The P7 without the SIMD speedup was greater than the clock speedup which is extremely good. That implies the compiler is doing a great job especially in C.
• The unroll did not enhance SIMD as BGQ does.
P7_BG.CAM Test loop
Some of the tests showed better performance on P7 versus October 6, 2011 optimization BG machine, while some tests had slightly worse performance. No large discrepancies were noticed. P7_BG.MF24
• SIMDzation on P7
• Compare C and FORTRAN Some of the tests showed better performance on P7 versus BG machine, while some tests had slightly worse performance. No large discrepancies were noticed.
P5_P7.epcc
Evaluate the overhead of the P7 with OpenMP
• Most of the timing improvements for P7 is for 2 threads.
• Comparing with P5, 2/4/8 threads shows significant slow down for SINGLE, CRITICAL, LOCK/UNLOCK an ATOMIC.
(except 2 threads timing for SINGLE)
• Among various P7 versions of compiler, the latest version is slightly better. P5_p7.matrix Effectiveness of various hardware matrix performances
• Hardware matrix performance tests using UMT • P7 is the first compiler to deliver better wall time speedup compare with clock rate speedup of the chip. That translates to excellent compiler work.
• Significantly fewer instructions for P7 -half -and 2/3 cycles compared to P5
All parties to this endeavor agree that the evaluation met the original goal stated above. The collaboration proved fruitful in uncovering some shortcomings to the PERSC compiler and in showing its strengths, as stated in the results above. The monthly telecons between the invested parties were instrumental in keeping the project on-task, and in relaying information to IBM in a timely manner.
1. SIMDzation on POWER7.
2. The effectiveness of MASSV on POWER7. 3. Various compiler optimization flags for code transformations. 4. The effectiveness of the 12 prefetch streams that P7 has. 5. Compare performance between C and Fortran compiler for the same code.
Tasks, Milestones, Deliverables, Schedules:
1. We will use Power5 timing as a base line as long as we have the machine in site. 2. We will use MF24 to test item 1. 3. We will use Crystal to test item 2. 4. All benchmarks will use various compilation flags for item 3. 5. We will use IRS to test item 4. 6. We will use UMT/MF24 to test item 5. 7. We will use UMT to test how well the compiler can unroll the loop and generate SIMD instruction *without* actually vectorize the loop. 8. We will use a optimize UMT kernel used in BG/Q test to compare its performance against Power7.
9. As each part is completed, we will report our findings in a technical report. All sections will be complete by April 30, 2011. 
Discussion:
The initial scope for this compiler collaboration is for C/C++/Fortran. We can discuss extending the scope later if needed.
Jeffrey mentioned that Dolores had been fostering the collaboration. She had ideas about getting compilers in hands of users. Jeffrey also mentioned that Yaoqing Gao had given a nice presentation on some new compiler features at one of the MS meetings and he is interested in getting it out to folks using it. Jeffrey suggested gathering some information from DoD folks.
Bor Chan would like to know:
• How the P7 vector unit works, to help Bor understand BG/Q vs P7 behavior of microkernels.
• Action items: 1. PERCS mission partners: Get access to POWER 7 system. Dolores should be working on getting Bor on M166. Jeffrey will send Joe Cross a message and cc Bor. 2. Bor also suggested Jeff/Evi…etc should start working on SOW (Statement of work) for LLNL. 3. IBM will provide documents on IBM compilers and MS compiler presentations to help bring Bor and DOD folks up to speed on compiler optimization so far. 4. There was no DoD representation in today's meeting due to a meeting conflict.
Jeffrey suggested that we gather some feedback from them.
Telecon Meeting October 28, 2010
Agenda:
1. PERCS compiler collaboration SOW discussion (all); 2. Performance tuning tips on POWER7 using IBM XL compilers (Yaoqing Gao/IBM); 3. Initial kernel analysis and tuning (Bor Chan/LLNL) Action items: 1. provide POWER7 hardware information and performance counter information (to be done) 2. provide OpenMP environment variable setting for Bor to tune OpenMP code (done) 3. provide PERCS mission partners the compiler presentation schedule in SC10 (to be finalized) 4. Bor will provide the source of UMT (done), XL compiler team will investigate O3 vs.
O5 performance and discuss it in the next meeting; 5. Provide PERCS mission partners with the presentation slides of performance tuning on POWER7 using XL C/C++/Fortran compilers (done, the draft slides sent to Dolores) 6. Schedule the next meeting in the week of before or after SC10 (to be finalized)
Telecon Meeting November 24, 2010
1. 1. Bor presented performance analysis from hardware aspect with the following summary: • The project finished on time and within budget. We deliver more than we promised (OMP overhead analysis and detail analysis of the code generated by the compiler for UMT).
• For the power series chips, P7 is the first compiler deliver better wall time speedup compare with clock rate speedup of the chip. That means excellent compiler work.
• Many thanks to IBM's compiler group. It provides in depth information for the new compiler technologies and how to take advantage of them. Result of UMT: • Wall time speedup better than the clock rate speedup.
• Compiler generates only half of the instructions and use 2/3 of cycles compare with P5.
• The cache miss is about the same as P5 but DTLB miss is much less.
• Branch mispredict-take/not taken is better in P7 compare with P5.
• Various units stall less on P7 except for stall caused by D cache miss.
2. The compiler collocation has been very productive. // O3 FFLAGS = -c -O3 -qhot -qarch=auto -qtune=auto -qcache=auto -qsmp=omp -qnosave -qfree=f90 -qsuffix=cpp=F90 CFLAGS = -c -O3 -qhot -qalias=allp -qarch=auto -qtune=auto -qcache=auto -qsmp=omp -qlanglvl=stdc99 LDFLAGS = -blpdataqsmp=omp // O5 replace -O3 with -O5 //***** P7 3.36GHz 1.77XP5 FFLAGS = -c -O3 -qhot -qsimd=auto -qhot=novector -qarch=auto -qtune=auto -qcache=auto -qsmp=omp -qnosave -qfree=f90 -qsuffix=cpp=F90 CFLAGS = -c -O3 -qhot -qsimd=auto -qhot=novector -qalias=allp -qarch=auto -qtune=auto -qcache=auto -qsmp=omp -qlanglvl=stdc99 // O5 replace -O3 with -O5 //***** P7 3.36GHz 1.77XP5 // O3 FFLAGS = -c -O3 -qhot -qsimd=auto -qhot=novector-qarch=auto -qtune=auto -qcache=auto -qsmp=omp -qnosave -qfree=f90 -qsuffix=cpp=F90 
