Altimesh Hybridizer by Duguet, Florent
© Altimesh 2016 – TES 2016 – all rights reserved
Altimesh Hybridizer™
Embrace Micro-Architecture Changes
Abstract-Out Instruction Set Variety
Achieve State-Of-The-Art Performance
© Altimesh 2016 – TES 2016 – all rights reserved
Why HPE ?
• Center of Excellence EMEA located in Grenoble
– Talented support team
– Ease of access for pre-GA hardware
• Hardware variety
– Comprehensive Intel solutions
– Moonshot platform (ARM)
– Accelerators AMD and NVIDIA
2
© Altimesh 2016 – TES 2016 – all rights reserved
Finance and Regulation
• Financial institutions are very creative
– Derivative products ecosystem grows constantly
– Some players introduce new product types to leverage corner 
unregulated financial traits [e.g. Subprimes]
• Every big financial event yields new regulations
– More stress scenarios [Too big to fail]
– More complex financial quantitative models [Liquidity]
– Higher number of simulations [unlikely systemic events]
• Quant analysts need to (re-)design quant libraries constantly
– New models need to be developed, tested and integrated in existing 
system
– Performance is getting critical: from thousands to millions of 
simulations – same power envelope ?
– Code optimization gets low priority: following changes implied by 
regulators is already a heavy burden
3
© Altimesh 2016 – TES 2016 – all rights reserved
Processor Ecosystem
• Processors have changed
• Frequency drops, Core count / vector unit explodes
• Most problems get memory bound (flop / memop > 25)
• Multithreading is not the only issue (SIMD/SIMT ratio)
• Keeping-up with technology changes requires significant 
software development effort and training
4
year 2000 2014 2013 2016 2012
processor Pentium 4 Xeon E5-v3 Xeon PHI KNL Kepler
core frequency (GHz) 3,8 2,3 1,24 ? 0,745
vector unit size (DP) 1 4 8 8 32
pipelines / core 1 2 1 2 2
contexts 1 2 4 4 4
core count 1 18 61 72 15
FMA 1 2 2 2 2
Peak scalar GFLOPS 3,8 165,6 151,28 375+ 22,35
Peak GFLOPS (DP) 3,8 662,4 1210,24 3000+ 1430,4
SIMD/SIMT ratio 1 4 8 8 64
Bandwidth (R/W) 4,26 68 352 400+ 288
flop / memop 7,14 77,93 27,51 ~60 39,73
Bandwidth / core 4,26 3,78 5,77 ~5,6 19,20
© Altimesh 2016 – TES 2016 – all rights reserved
Key Changes to Embrace
• Multithread : core count explode, and frequency stalls 
or decrease => not using multithread will lead to 
performance decrease in the future
• Vectorize : vector unit size grows. SIMD/SIMT ratio 
indicates the relative loss when not vectorizing code. 
AVX-512 will double the fall for Intel x86 
architecture.
• Cache-aware : flop/memop increase (> 25). Operations 
need to occur in cache. Large vector operations are 
memory bound and should be replaced by small vector 
operations
5
Hybridizer aims at addressing these
challenges with a unified approach
© Altimesh 2016 – TES 2016 – all rights reserved
• Input
– .Net
– Java
– C/C++ (ongoing developments)
• Environments:
– Windows / Linux
• Generate source code
– CUDA/C for NVIDIA GPU
– C++ for native 
platforms
– Open CL
Hybridizer Solution
[C/C++]
Java
.Net
Hybridizer
CUDA/C
NVIDIA
C++
AVX-2
x86 CPU
Xeon PHI
Intel
AVX-512
Intel KNL/Skylake
OpenCL
6
Unified work
distribution 
pattern
© Altimesh 2016 – TES 2016 – all rights reserved
Hybridizer Benefits
• Single version of source code
– Express parallelism with a paradigm of choice (ParallelFor / 
iterators / custom indexing type)
– Generates several flavors of source code 
• Execution on a variety of platforms
– Plain C, CUDA
– Vector-units: AVX, AVX2, AVX-512
– External libraries integration (e.g. MKL) and extensibility
(hand-tuned micro-architecture specific codes)
• Debugging / Profiling of output
– Code location is preserved on target platform
– Integration in existing debugging / profiling tools
– Generated source-code is readable for auditing
7
© Altimesh 2016 – TES 2016 – all rights reserved
Integration with Intel Vtune Amplifier
8
Scalar C# source File
AVX2 instructions
Mapped on Intel SVMLStandard System.Math methods
© Altimesh 2016 – TES 2016 – all rights reserved
Matrix Multiply
Naive Matrix Multiply
Block-accumulation (better
cache behavior?)
9
Prefer Vendor-Tuned Libraries
Matrix-Multiply sounds simple, however it 
involves advanced features:
• Vector-unit operations
• Non-temporal write
• Several layers of memory prefetching
• Many corner cases for unaligned sizes, 
transposes, etc.
GFLOPS
24,9
MATRIX MULTIPLY (C++ / INTEL 15.0)
Naive
© Altimesh 2016 – TES 2016 – all rights reserved
Matrix Multiply
Naive Matrix Multiply
Splitting loops (better
cache behavior?)
10
Prefer Vendor-Tuned Libraries
Matrix-Multiply sounds simple, however it 
involves advanced features:
• Vector-unit operations
• Non-temporal write
• Several layers of memory prefetching
• Many corner cases for unaligned sizes, 
transposes, etc.
GFLOPS
24,9
0,84
MATRIX MULTIPLY (C++ / INTEL 15.0)
Naive
Block-ordering
© Altimesh 2016 – TES 2016 – all rights reserved
Matrix Multiply
Naive Matrix Multiply
Splitting loops (better
cache behavior?)
11
Prefer Vendor-Tuned Libraries
Matrix-Multiply sounds simple, however it 
involves advanced features:
• Vector-unit operations
• Non-temporal write
• Several layers of memory prefetching
• Many corner cases for unaligned sizes, 
transposes, etc.
GFLOPS
24,9
0,84
114
MATRIX MULTIPLY (C++ / INTEL 15.0)
Naive
Block-ordering
MKL DGEMM
© Altimesh 2016 – TES 2016 – all rights reserved
A Good Compiler Is Not Enough
Use Vendor-Tuned Libraries
• « What every programmer should know about memory », by Ulrich Drepper
– It takes a lot to write (close to) optimal code
– Understanding of core components of the system are necessary to get good 
performance (getting a compute-bound implementation of matrix multiply is 
hard)
• Micro-architecture evolve
– AVX means 256 bits operands => new instruction set wrt SSE
– AVX-2 has more instructions => need to redefine some code (different 
latencies, fused multiply-add, integer operations, gather instruction)
– AVX-512 is totally different, moreover flops/memops ratio evolves => need 
to rewrite
• Vendors provide optimized libraries (Intel MKL)
– Prefer optimized libraries over hand-written versions
– Most often better performance writing code to transition from custom data 
layout to optimized library’s data layout
• Hybridizer integrates these libraries with Extensibility attributes
– Available through wrapper methods (no overhead)
– No overhead using these libraries
– Same approach to integrate existing in-house developments
12
© Altimesh 2016 – TES 2016 – all rights reserved
ON PERFORMANCE
13
© Altimesh 2016 – TES 2016 – all rights reserved
Benchmark-Level Performances
14
GOptions/s
0,433
0,102
0,402
BLACK-SCHOLES - CLOSED FORM
C++ annotated / Intel Compiler
DotNet
Hybridizer
7% overhead
© Altimesh 2016 – TES 2016 – all rights reserved
Extended features
Virtual Functions
• Interfaces / abstract 
classes and inheritance
is supported
• Underlying implementation
is a function-table
Generics
• Generics get mapped onto 
templates
• C++ template concepts are 
expressed by DotNet/Java 
generics constraints
• Restored performance
15
Object oriented programming
productivity maintained …
… And overhead can be removed
© Altimesh 2016 – TES 2016 – all rights reserved
Financial Model Spot Diffusion
Dot net source code 
Generic parameters for flexibility
C++ source code with annotations 
(two outer loop configurations)
16
© Altimesh 2016 – TES 2016 – all rights reserved
Black-Scholes-Merton Diffusion  
17
Dot Net Hybridizer C++ / Intel
Compiler
Dot Net Hybridizer C++ / Intel
Compiler
Sim outer loop Time outer loop
0,1326
0,7189 0,4105 0,1805
0,7506 0,7837
0,2116
3,300
2,659
0,1826
3,284
4,723
FINANCIAL MODEL SPOT DIFFUSION – GSTEPS/S
16384 simulations (off-cache) 512 simulations (L2-cache)
Memory-bound Coumpute-bound
• Comparing object-oriented code, with generics, processed by Hybridizer
• with hand-written optimized C++ code compiled with Intel Composer 2015
© Altimesh 2016 – TES 2016 – all rights reserved
Black-Scholes-Merton Diffusion  
18
• Hybridizer greatly improves dotnet
performance: 5x to 18x
• Object oriented programming
preserved: single version of 
source code, reduces operational
risk / testing costs.
Dot Net Hybridizer C++ / Intel
Compiler
Dot Net Hybridizer C++ / Intel
Compiler
Sim outer loop Time outer loop
0,1326
0,7189 0,4105 0,1805
0,7506 0,7837
0,2116
3,300
2,659
0,1826
3,284
4,723
FINANCIAL MODEL SPOT DIFFUSION – GSTEPS/S
16384 simulations (off-cache) 512 simulations (L2-cache)
Memory-bound Coumpute-bound
Significant dotnet
performance 
improvement
© Altimesh 2016 – TES 2016 – all rights reserved
• Hybridizer provides benchmark-
level performances (96% of best 
performing off-cache)
Black-Scholes-Merton Diffusion  
19
• Hybridizer greatly improves dotnet
performance: 5x to 18x
• Object oriented programming
preserved: single version of 
source code, reduces operational
risk / testing costs.
Dot Net Hybridizer C++ / Intel
Compiler
Dot Net Hybridizer C++ / Intel
Compiler
Sim outer loop Time outer loop
0,1326
0,7189 0,4105 0,1805
0,7506 0,7837
0,2116
3,300
2,659
0,1826
3,284
4,723
FINANCIAL MODEL SPOT DIFFUSION – GSTEPS/S
16384 simulations (off-cache) 512 simulations (L2-cache)
Memory-bound Coumpute-bound
Small overhead
for off-cache (4%)
© Altimesh 2016 – TES 2016 – all rights reserved
• Hybridizer provides benchmark-
level performances (96% of best 
performing off-cache)
• Loop ordering has little impact 
for Hybridizer version (~4%) yet
large impact for hand-written
implementation (>45%)
Black-Scholes-Merton Diffusion  
20
• Hybridizer greatly improves dotnet
performance: 5x to 18x
• Object oriented programming
preserved: single version of 
source code, reduces operational
risk / testing costs.
NOTE: cache-locality and outer-loop selection has a 10x impact on performance. Writing optimized C++ code requires significant effort and knowledge.
Dot Net Hybridizer C++ / Intel
Compiler
Dot Net Hybridizer C++ / Intel
Compiler
Sim outer loop Time outer loop
0,1326
0,7189 0,4105 0,1805
0,7506 0,7837
0,2116
3,300
2,659
0,1826
3,284
4,723
FINANCIAL MODEL SPOT DIFFUSION – GSTEPS/S
16384 simulations (off-cache) 512 simulations (L2-cache)
Memory-bound Coumpute-bound
Loop ordering has 
little impact on 
Hybridizer version
© Altimesh 2016 – TES 2016 – all rights reserved
HOW ABOUT AVX-512 ?
21
© Altimesh 2016 – TES 2016 – all rights reserved
How about AVX-512 ?
• Hybridizer generates C++ 
using small vector library 
(a.k.a. phivect)
• Phivect is implemented and 
optimized for several micro-
architectures
• AVX-512 version of phivect is 
fully functional. 
22
Hybridizer
C++ 
(phivect)
AVX-2
x86 CPU
Xeon PHI
Intel
AVX-512
Intel KNL/Skylake
© Altimesh 2016 – TES 2016 – all rights reserved
Conclusions
• Shortened development cycles
– Single version of source code – with « managed » languages
– Integrates with Debuggers and Profilers
• State-of-the art performances
– Software development flexibility without performance costs
– Close to Benchmark (>90%) for compute and memory bound 
problems
• Embrace micro-architecture changes
– Hybridizer is AVX-512 ready – simply recompile ?
23
http://www.altimesh.com
