Architecture level Optimizations for Kummer based HECC on FPGAs by Gallin, Gabriel et al.
Architecture level Optimizations for
Kummer based HECC on FPGAs
Gabriel GALLIN – Turku Ozlum CELIK – Arnaud TISSERAND




size of GF(P) elems. ADD DBL source
ECC `ECC 12M + 2S 7M + 3S [2]
HECC `HECC ≈ 12`ECC 40M + 4S 38M + 6S [7]
KHECC `HECC 19M + 12S [10]
Metric for algorithms efficiency: number of multiplications (M) and squares (S) in GF(P)
Kummer-HECC (KHECC) is more efficient than ECC:
I Software implementations by Renes et al. at CHES 2016 [10]
I ARM Cortex M0: up to 75% clock cycles reduction for signatures
I AVR AT-mega: up to 32% cycles reduction for Diffie-Hellman
G.Gallin – T.O.Celik – A.Tisserand Indocrypt 2017 Dec., 11th 2017 2 / 21









x ± y x x y
Protocols
xDBLADD(P,Q,Pb)
I Protocols based on scalar multiplication
I Sequence of curve-level operation xDBLADD:
(±P,±Q,±(P−Q))→ (±[2]P,±(P +Q))
I Size of elements in GF(P): 128 bits
I Dedicated hyper-threaded multiplier [3]:
3 independent modular multiplications
computed in parallel
G.Gallin – T.O.Celik – A.Tisserand Indocrypt 2017 Dec., 11th 2017 3 / 21
Scalar Multiplication: Montgomery Ladder
Montgomery ladder based crypto scalarmult from [10]:
Require: m-bit scalar k =
∑m−1
i=0 2
iki , point Pb, cst ∈ GF(P)4
Ensure: V1 = [k]Pb, V2 = [k + 1]Pb
V1 ← cst
V2 ← Pb
for i = m − 1 downto 0 do
(V1,V2)← CSWAP(ki , (V1,V2))
(V1,V2)← xDBLADD(V1,V2,Pb)
(V1,V2)← CSWAP(ki , (V1,V2))
end for
return (V1,V2)
CSWAP(ki , (X ,Y )) returns (X ,Y ) if ki = 0, else (Y ,X )
I Constant time, uniform operations (independent from key bits)
I CSWAP: very simple but handles secret bits (to be protected)









































cst cst cst cst
cst cst cst








I Some parallelism available (up to 8 GF(P) operations)
I Several possible hardware architectures can be implemented
G.Gallin – T.O.Celik – A.Tisserand Indocrypt 2017 Dec., 11th 2017 5 / 21
Architectural Exploration
I Fast exploration and validation of numerous hardware architecture
configurations with dedicated tools (cf. paper)
I Full implementation of 4 selected architectures
A1: Smallest architecture
A2: Modification of CSWAP
A3: Doubled number of arithmetic units
A4: Doubled number of units (arithmetic and MEM) in
2 clusters
I Width of MEM and interconnect to be selected: w = 34, 68 or 136 bits
G.Gallin – T.O.Celik – A.Tisserand Indocrypt 2017 Dec., 11th 2017 6 / 21
Architecture A1: Base Solution















G.Gallin – T.O.Celik – A.Tisserand Indocrypt 2017 Dec., 11th 2017 7 / 21
FPGA w LUT FF logic DSP RAM freq. clock time
[bit] slices slices blocks [MHz] cycles [ms]
V4
34 1010 1833 1361 11 4 322 194,614 0.60
68 1750 3050 2251 11 5 305 186,911 0.61
136 2281 3028 1985 11 7 266 184,337 0.69
V5
34 757 1816 603 11 4 360 194,614 0.54
68 1264 3033 908 11 5 360 186,911 0.52
136 1582 3008 940 11 7 360 184,337 0.51
S6
34 1064 1770 408 11 4 278 194,614 0.70
68 1555 2970 705 11 5 252 186,911 0.74
136 1910 2994 747 11 7 221 184,337 0.83
I Area increases when w increases
I Increased number of BRAMs for large memories
I Small clock cycles reduction for larger w cancelled by
frequency drops
I Small w 34 more interesting for A1 architecture
G.Gallin – T.O.Celik – A.Tisserand Indocrypt 2017 Dec., 11th 2017 8 / 21
Architecture A2: CSWAP Optimization
I Same architecture topology as A1:
1 AddSub, 1 Mult, 1 MEM and 1 modified CSWAP
I Modified CSWAP unit implements new CSWAPV2 operation:
I Merged consecutive CSWAP operations of successive iterations
(V1,V2)← CSWAPV2((0, km−1), (V1,V2))
for i = m − 1 downto 1 do
(V1,V2)← xDBLADD(V1,V2,Pb)
(V1,V2)← CSWAPV2((ki , ki−1), (V1,V2))
end for
I Swaps V 1 and V 2 if ki 6= ki−1 (only one xor gate needed)
I CSWAP unit has constant time behavior
G.Gallin – T.O.Celik – A.Tisserand Indocrypt 2017 Dec., 11th 2017 9 / 21
FPGA w LUT FF logic DSP RAM freq. clock time
[bit] slices slices blocks [MHz] cycles [ms]
V4
34 872 1624 1121 11 4 330 184,374 0.56
68 1556 2637 1978 11 5 290 183,071 0.63
136 2161 3027 2100 11 7 327 183,057 0.56
V5
34 722 1605 541 11 4 360 184,374 0.51
68 1196 2620 840 11 5 360 183,071 0.51
136 1419 3009 944 11 7 360 183,057 0.51
S6
34 940 1559 381 11 4 293 184,374 0.63
68 1503 2565 553 11 5 262 183,071 0.70
136 1890 2981 667 11 7 283 183,057 0.65
I Less CSWAPV2 operations ⇒ slightly less clock cycles than in A1
I Simplified management of CSWAPV2 operations
I Slightly higher frequencies, with smaller variations
I Slightly reduced area (LUTs and FFs)
I A2 slightly more interesting than A1 both for speed and area (∼ 10%)
I Small w 34 still the best configuration
G.Gallin – T.O.Celik – A.Tisserand Indocrypt 2017 Dec., 11th 2017 10 / 21
Architecture A3: Large Architecture
I Doubled number of GF(P) units: 2 AddSub, 2 Mult
















G.Gallin – T.O.Celik – A.Tisserand Indocrypt 2017 Dec., 11th 2017 11 / 21
FPGA w LUT FF logic DSP RAM freq. clock time
[bit] slices slices blocks [MHz] cycles [ms]
V4
34 1462 2611 1783 22 6 294 188,218 0.64
68 2802 4367 3468 22 7 282 124,191 0.44
136 3768 5017 3660 22 9 285 119,057 0.42
V5
34 1262 2607 921 22 6 358 188,218 0.53
68 2290 4403 1409 22 7 345 124,191 0.36
136 2737 4978 1594 22 9 348 119,057 0.34
S6
34 1527 2503 668 22 6 265 188,218 0.71
68 2421 4267 1020 22 7 225 124,191 0.55
136 3007 4877 1131 22 9 225 119,057 0.53
I +60–90% LUTs, 11 DSP slices, + 2 BRAMs compared to A2
I Frequency drops on V4 (< 13%) and S6 (< 20%)
I – 34–36% clock cycles for w 68 and w 136, compared to w 34
I 25 to 35% reduced computation time for w 136 depending on FPGA
I A3 faster than A2, but larger → area – speed trade-offs
G.Gallin – T.O.Celik – A.Tisserand Indocrypt 2017 Dec., 11th 2017 12 / 21


















I Decomposition of xDBLADD into two symmetric clusters
of GF(P) operations
I Modifications of xDBLADD:
I Squares → multiplications
I No impact on mathematical behavior nor on operations count
I New modification of CSWAP: CSWAPV3
I Replaced by two new swapping operations
I CS0(A,B,C ,D)→ (A,B,C ,B) if ki = 0 else (C ,D,A,D)
I CS1(A,B,C ,D)→ (A,B,C ,D) if ki = 0 else (C ,D,A,B)
G.Gallin – T.O.Celik – A.Tisserand Indocrypt 2017 Dec., 11th 2017 13 / 21




















I Decomposition of xDBLADD into two symmetric clusters
of GF(P) operations
I Modifications of xDBLADD:
I Squares → multiplications
I No impact on mathematical behavior nor on operations count
I New modification of CSWAP: CSWAPV3
I Replaced by two new swapping operations
I CS0(A,B,C ,D)→ (A,B,C ,B) if ki = 0 else (C ,D,A,D)
I CS1(A,B,C ,D)→ (A,B,C ,D) if ki = 0 else (C ,D,A,B)
G.Gallin – T.O.Celik – A.Tisserand Indocrypt 2017 Dec., 11th 2017 13 / 21


















I Decomposition of xDBLADD into two symmetric clusters
of GF(P) operations
I Modifications of xDBLADD:
I Squares → multiplications
I No impact on mathematical behavior nor on operations count
I New modification of CSWAP: CSWAPV3
I Replaced by two new swapping operations
I CS0(A,B,C ,D)→ (A,B,C ,B) if ki = 0 else (C ,D,A,D)
I CS1(A,B,C ,D)→ (A,B,C ,D) if ki = 0 else (C ,D,A,B)
G.Gallin – T.O.Celik – A.Tisserand Indocrypt 2017 Dec., 11th 2017 13 / 21
I Same number of GF(P) units as in A3: 2 AddSub, 2 Mult
I Doubled number of MEM : one for each hardware cluster
I CSWAP unit : “bridge” to exchange data between clusters



















G.Gallin – T.O.Celik – A.Tisserand Indocrypt 2017 Dec., 11th 2017 14 / 21
FPGA w LUT FF logic DSP RAM freq. clock time
[bit] slices slices blocks [MHz] cycles [ms]
V4
34 1695 2950 2158 22 7 324 142,119 0.44
68 2804 4282 3184 22 9 290 128,021 0.44
136 3171 4994 3337 22 13 299 125,456 0.42
V5
34 1370 2953 1013 22 7 358 142,119 0.40
68 2095 4259 1358 22 9 337 128,021 0.38
136 2514 4952 1589 22 13 313 125,456 0.40
S6
34 1564 2089 758 22 7 262 142,119 0.54
68 2387 4030 1060 22 9 239 128,021 0.54
136 3181 4786 1136 22 13 251 125,456 0.50
I Increased area for w 34 compared to A3
I Increased number of BRAMs for additional MEM
I Less clock cycles for w 34 ⇒ MEM bottleneck in small configurations
I A4 better than A3 for small configuration w 34
G.Gallin – T.O.Celik – A.Tisserand Indocrypt 2017 Dec., 11th 2017 15 / 21
Trade-offs for our Architectures A1–4





























archi. A1 A2 A3 A4
#Mult 1 1 2 2
#AddSub 1 1 2 2
#CSWAP 1 1 1 1
#MEM 1 1 1 2
G.Gallin – T.O.Celik – A.Tisserand Indocrypt 2017 Dec., 11th 2017 16 / 21
Trade-offs for our Architectures A1–4
































































archi. A1 A2 A3 A4
#Mult 1 1 2 2
#AddSub 1 1 2 2
#CSWAP 1 1 1 1
#MEM 1 1 1 2
G.Gallin – T.O.Celik – A.Tisserand Indocrypt 2017 Dec., 11th 2017 16 / 21
Comparisons with ECC State-of-the-Art
year ref. target P LUT FF logic DSP RAM freq. time
slices slices blocks [MHz] [ms]
2008 [4]
XC4VFX12 NIST-256 2589 2028 1715 32 11 490 0.50
XC4VFX12 NIST-256 34896 32430 24574 512 176 375 0.04
2014 [1] XC6VFX760 NIST-256 32900 n.a. 11200 289 128 100 0.40
2012 [6]
XC4VFX12 GEN-256 n.a. n.a. 2901 14 n.a. 227 1.09
XC5VLX110 GEN-256 n.a. n.a. 3657 10 n.a. 263 0.86
2013 [8]
XC4VLX100 GEN-256 5740 4876 4655 37 11 250 0.44
XC5VLX110T GEN-256 4177 4792 1725 37 10 291 0.38
2017 A4(w 34)
XC4VLX100 GEN-128 1695 2950 2158 22 7 324 0.44
XC5VLX110T GEN-128 1370 2953 1013 22 7 358 0.40
G.Gallin – T.O.Celik – A.Tisserand Indocrypt 2017 Dec., 11th 2017 17 / 21
Conclusion and Perspectives
I Kummer-HECC efficient alternative to ECC in hardware:
I Halved area for same computation time
I Scalar multiplication 40% faster for same area cost
compared to equivalent state-of-the-art solutions for ECC
I Exploration of new architectures: topology, control, protection
against SCA
I Release of VHDL codes and exploration tools under open-source
license (by the end of Spring)
G.Gallin – T.O.Celik – A.Tisserand Indocrypt 2017 Dec., 11th 2017 18 / 21
References I
[1] H. Alrimeih and D. Rakhmatov.
Fast and flexible hardware support for ECC over multiple standard prime fields.
IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 22(12):2661–2674, January 2014.
[2] D. J. Bernstein and T. Lange.
Explicit-formulas database.
http://hyperelliptic.org/EFD/.
[3] G. Gallin and A. Tisserand.
Hyper-threaded multiplier for HECC.
In Proc. 51st Asilomar Conference on Signals, Systems and Computers, Pacific Grove, CA, USA, October 2017. IEEE.
[4] T. Güneysu and C. Paar.
Ultra high performance ECC over NIST primes on commercial FPGAs.
In Proc. 10th Conf. Cryptographic Hardware and Embedded Systems (CHES), volume 5154 of LNCS, pages 62–78.
Springer, August 2008.
[5] C. K. Koc, T. Acar, and B. S. Kaliski.
Analyzing and comparing Montgomery multiplication algorithms.
IEEE Micro, 16(3):26–33, June 1996.
[6] J.-Y. Lai, Y.-S. Wang, and C.-T. Huang.
High-performance architecture for elliptic curve cryptography over prime fields on FPGAs.
Interdisciplinary Information Sciences, 18(2):167–173, 2012.
[7] T. Lange.
Formulae for arithmetic on genus 2 hyperelliptic curves.
Applicable Algebra in Eng., Communication and Computing, 15(5):295–328, February 2005.
G.Gallin – T.O.Celik – A.Tisserand Indocrypt 2017 Dec., 11th 2017 19 / 21
References II
[8] Y. Ma, Z. Liu, W. Pan, and J. Jing.
A high-speed elliptic curve cryptographic processor for generic curves over GF(p).
In Proc. 20th International Workshop on Selected Areas in Cryptography (SAC), volume 8282 of LNCS, pages 421–437,
Burnaby, BC, Canada, August 2013. Springer.
[9] P. L. Montgomery.
Modular multiplication without trial division.
Mathematics of Computation, 44(170):519–521, April 1985.
[10] J. Renes, P. Schwabe, B. Smith, and L. Batina.
µKummer: Efficient hyperelliptic signatures and key exchange on microcontrollers.
In B. Gierlichs and A. Y. Poschmann, editors, Proc. 18th International Conference on Cryptographic Hardware and
Embedded Systems (CHES), volume 9813 of LNCS, pages 301–320, Santa Barbara, CA, USA, August 2016. Springer.
G.Gallin – T.O.Celik – A.Tisserand Indocrypt 2017 Dec., 11th 2017 20 / 21
This work is funded by H-A-H project
Thank you for your attention
G.Gallin – T.O.Celik – A.Tisserand Indocrypt 2017 Dec., 11th 2017 21 / 21
ECC State-of-the-Art Comparisons vs A4
Comparisons of A4 with ECC State-of-the-Art (%)
year ref. target P LUT FF logic DSP RAM freq. time
slices slices blocks [MHz] [ms]
2008 [4]
XC4VFX12 NIST-256 -35% +46% +26% -31% -36% -34% -12%
XC4VFX12 NIST-256 -95% -91% -91% -96% -96% -14% +1000%
2014 [1] XC6VFX760 NIST-256 -96% .n.a. -91% -92% -95% +258% +0%
2012 [6]
XC4VFX12 GEN-256 n.a. n.a. -26% +57% n.a. +43% -60%
XC5VLX110 GEN-256 n.a. n.a. -72% +120% n.a. +36% -53%
2013 [8]
XC4VLX100 GEN-256 -70% -39% -54% -41% -36% +30% +0%
XC5VLX110T GEN-256 -67% -38% -41% -41% -30% +23% +5%
2017 A4(w 34)
XC4VLX100 GEN-128 1695 2950 2158 22 7 324 0.44
XC5VLX110T GEN-128 1370 2953 1013 22 7 358 0.40
G.Gallin – T.O.Celik – A.Tisserand Indocrypt 2017 Dec., 11th 2017 22 / 21
Exploration Tools
Architecture Level Modeling
I Problems when exploring solutions space:
I Many parameters: type/number of units, communications, control, ...
I Description in VHDL and debug of accelerators is time consuming
I Proposed solution: hierarchical description of accelerators
I Allows fast exploration and validation of numerous solutions
I Based on a library of units, fully described and implemented in VHDL
I CCABA model defined for high-level description of accelerators
G.Gallin – T.O.Celik – A.Tisserand Indocrypt 2017 Dec., 11th 2017 23 / 21
Exploration Tools
Units
I Multiplier Mult using HTMM BRAM for multiplications and squares
I Adder-Subtractor AddSub
I Datapath width w arith = 34 bits selected for Mult and AddSub after
experimentations
I Swapping unit CSWAP with local key management and uniform
behavior
I Memory MEM based on dual port RAMs with width w to be selected
between 34, 68 or 136 bits
G.Gallin – T.O.Celik – A.Tisserand Indocrypt 2017 Dec., 11th 2017 24 / 21
Exploration Tools
Accelerator Control and Interconnect
I Instantiate requiered units
I Interconnect all units
I Based on multiplexors
I Width to be selected: w = 34, 68 or 136 bits
I Control
I Based on a tiny 36-bit instructions set architecture
I Scalar bits managed only in CSWAP unit:
control signals do not handle or depends on secret key
G.Gallin – T.O.Celik – A.Tisserand Indocrypt 2017 Dec., 11th 2017 25 / 21
Selected Results
Most interesting FPGA implementation results
archi. w target logic DSP RAM freq. time
[bit] slices blocks blocks [MHz] [ms]
A2 34
V4
1121 11 4 330 0.56
A3 136 3660 22 9 285 0.42
A4 34 2158 22 7 324 0.44
A2 34
V5
541 11 4 360 0.51
A3 136 1594 22 9 348 0.34
A4 34 1013 22 7 358 0.40
A2 34
S6
381 11 4 293 0.63
A3 136 1131 22 9 225 0.53
A4 34 758 22 7 262 0.54




read transfer operands from memory to target unit and start computation
write transfer result from target unit to memory
wait wait for immediate clock cycles
nop no operation (1 clock cycle)
jump change program counter (PC) to immediate code address




two 9-bit memory addresses
9-bit immediate value.
G.Gallin – T.O.Celik – A.Tisserand Indocrypt 2017 Dec., 11th 2017 27 / 21
Architectures Details
Memory and Internal Communication Width Configurations
config. w [bit] s [word] cycle(s) / mem. op. BRAM(s)
w 34 34 4 4 1
w 68 68 2 2 2
w 136 136 1 1 4
G.Gallin – T.O.Celik – A.Tisserand Indocrypt 2017 Dec., 11th 2017 28 / 21
