











































 Processing-in-memory (PIM) architectures
 Methodology of evaluation for selected scientific 
computing applications
 Selected results and architecture assessment






















 Jaime Moreno, Rajiv Nair, José Brunheroto
+ others from AMC team
 Hans Boettiger, Thilo Maurer
JSC
 Paul Baumeister, Thorsten Hater, Andrea Nobile
Application developers
 Giannis Koutsou, Stefan Krieg, Hubert Simma
 Fabio Schifano, Lele Trippiccione
 Stefan Blügel





















Processing in memory (PIM)
Architectural arguments in favour of PIM
 Reduce (off-chip) data transport → less energy
 Larger Bfp / Bmem by addressing Rent's rule
→ higher performance
– More wires available to connect compute and storage
Various projects since 90s
 Computational RAM (1992)





 Major challenge: high costs, programming model












































IBM's Active Memory Cube (AMC)
PIM architecture based on Hybrid Memory Cube
 HMC = stack of logic die +
   multiple memory dies
 Communication through
through-silicon vias (TSV)
AMC adds 32 compute lanes
 Very Large Instruction Word (VLIW) architecture
 Temporal SIMD concept
Dual-ported memory concept
 Access from CPU and AMC lanes
 Coherent memory access within same address space
Chainable for capacity and compute performance












































 4 double-precision FMA per lane and cycle
 8 single-precision FMA per lane and cycle
Temporal SIMD concept
 Instructions up to 32× repeatable
 Vector registers of length 32
Memory performance
 8 Byte per lane and cycle → Bfp / Bmem = 1 (DP)
 Minimal access latency: 24 cycles
Target clock: 1.25 GHz
 Nominal floating-point throughput: 320 GFlop/s (DP), 640 GFlop/s (SP)
 Nominal memory bandwidth: 320 GByte/s




















Approach for application evaluation
Application selection
 Application with needs of increasingly scalable compute 
resources
Application roadmap assessment
 Interview of application experts based on questionnaire
Application kernel performance evaluation
 Port to AMC and cycle accurate simulation
System level performance assessment






















 Simulation of Quantum Chromodynamics: Lattice QCD
 Expected developments towards 2017:
– Increase of lattice volume up to 96³×256 or 
128³×256
– Simulations at physical quark masses
– Trend towards more complex algorithms
 PRACE: >400 (PFlop/s) * year in 2020
Computational fluid dynamics
 Current 2d Lattice Boltzmann formulation: D2Q37
 For future 3-dimensional formulation expect need
of 20 PFlop/s (DP, sustained) for about 64 days





















Condensed matter physics / material research
 Density Functional Method based approaches
– Focus on real-space and KKR formulation
 Use cases
– Complete simulations of entire nanostructures
→ need for scalability
– Multiple simulations for several systems and
many different sets of parameters
→ need for high throughput
 No relevant limits in intrinsic parallelism




















LQCD kernel on AMC
Focus on solver for Wilson-Dirac → SpMV
Performance signatures
 Balance parameters (SP)
– Ifp = 1320 Flop / site
– Imem= 1.4 kiByte / site


























LQCD kernel on AMC
Implementation limited to parts of matrix-vector 
multiplication
 Projection of spinors and multiplication with U 
 4 space-time directions mapped to 4 slices
Complex arithmetics
 For double precision no special support needed













































– Streaming memory access
– No arithmetic operations
 Collision kernel in case of D2Q37
– Balance parameters (DP)
●   Ifp = ~6,000 Flop / site




















D2Q37 kernels on AMC: Propagate
Computational task: 





































D2Q37 kernels on AMC: Collide
Computational task:
 Compute equilibrium and update local distribution
 D2Q37 compute intensive due to equilibrium distribution 
being expressed in terms of Hermite polynomials
Mapping to AMC
 Dense packing of instructions can be achieved:
– 10% ALU NOPS









































System level performance model
Parametric node architecture
 AMC-CPU and network bandwidth




















System level performance model (cont.)
Model assumptions
 Latency-bandwidth model ansatz: Tx = λx + I / βx 
 Perfect overlap of computation and communication
Ansatz
 Tcomp: Time needed for computations 
→ determination based on cycle accurate simulations
 Tmem: Time needed for intra-node communication
 Tnet: Time needed for inter-node communication
 T = max(Tcomp, max(Tmem, Tnet))
Parameter choices
 8 memory channels, 2 AMC per channel
 CPU-network: λnet = 1 μs, βnet = 100 Gbyte/s




















System level performance: LQCD
Observations
 Tmem > Tnet 
 Tcomp > Tmem for L ≥ 8
Performance
 ~130 GFlop/s (SP)
per AMC
 ~13 Gflop/s/W (SP)
(AMC only)
Strong scaling limits
 Assume V = 128³×256 
and l⁴ = 8 per lane
→ 4096 AMC
 Upper limit of 0.5 Pflop/s




















System level performance: D2Q37
Observations




 ~175 GFlop/s (DP)
per AMC























 Number of slices
 Number of scalar/vector registers, vectors register length
 Local Instruction Buffer size
 Load-store queue size
 Memory capacity and bandwidth
Conclusions LBM
 Collide performance limited by ISA
 Application could cope with longer vectors
Conclusions LQCD
 Performance limited by memory bandwidth






















 Continues to be an interesting architectural proposition
 Availability of products still unclear
Scientific computing applications on AMC
 Considered applications could exploit AMC efficiently
 Evaluation limited to relatively simple application kernels 
due to not yet available programming environment
 PIM attractive option for scientific computing
AMC architecture evaluation
 Architectural parameters matched application 
requirements well
