Hyper-Threaded Multiplier for HECC by Gallin, Gabriel & Tisserand, Arnaud
Hyper-Threaded Multiplier for HECC
Gabriel GALLIN and Arnaud TISSERAND
CNRS – Lab-STICC – IRISA
HAH Project
Asilomar, Oct. 2017
Public-Key Cryptography (PKC)
I Provides cryptographic primitives such as digital signature,
key exchange and specific encryption schemes
I First PKC standard: RSA
- ≥ 2000-bit keys recommended today
- Too costly for embedded applications
I Elliptic Curve Cryptography (ECC):
- Better performances and lower cost than RSA
- Allows more advanced schemes
I Hyper-Elliptic Curve Cryptography (HECC):
- Evolution of ECC focusing on larger sets of curves
- Supposed to have a smaller cost than ECC
ECC, HECC, Kummer-HECC
FP elements size ADD DBL source
ECC `ECC 12M + 2S 7M + 3S [1]
HECC `HECC ≈ 12`ECC 40M + 4S 38M + 6S [5]
Kummer `HECC 19M + 12S [8]
I ECC:
- Size of FP elements 2× larger
- Simpler ADD and DBL operations
I HECC:
- Smaller FP
- More operations in FP for ADD / DBL
I Kummer-HECC is more efficient than ECC [8]:
- ARM Cortex M0: up to 75% clock cycles reduction for
signatures
- AVR AT-mega: up to 32% cycles reduction for Diffie-Hellman
M multiplication, S square on field FP
Curve-Level Operations in Kummer
I No ADD operation but still DBL
I Differential addition: xADD(±P,±Q,±(P −Q))→ ±(P + Q)
I xADD and DBL can be combined:
xDBLADD(±P,±Q,±(P − Q))→ (±[2]P,±(P + Q))
For details see [8], [3] and [2]
xDBLADD FP Operations
a
s M
S
var
cst
OUT
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
S
S
S
S
S
S
S
S
S
S
S
s
s
s
s
s
s
s
s s
s s
s
s
s
s
a
a
a
a
a
a
a
a
a a
a
a a
aa
var
var
var
var
var
var
var
cst
cst
cst
cst
cst
cst
cst
cst
cst
cst
OUT
OUT
OUT
OUT
OUT
OUT
OUT
Scalar Multiplication
Montgomery ladder based crypto scalarmult [8]:
Require: m-bit scalar k =
∑m−1
i=0 2
iki , point Pb, cst ∈ F4P
Ensure: V1 = [k]Pb, V2 = [k + 1]Pb
V1 ← cst
V2 ← Pb
for i = m − 1 downto 0 do
(V1,V2)← CSWAP(ki , (V1,V2))
(V1,V2)← xDBLADD(V1,V2,Pb)
(V1,V2)← CSWAP(ki , (V1,V2))
end for
return (V1,V2)
CSWAP(ki , (X ,Y )) returns (X ,Y ) if ki = 0, else (Y ,X )
I Constant time, uniform operations (independent from key
bits)
I Some parallelism between xDBLADD internal FP operations
I CSWAP: very simple but involves secret bits (to be protected)
Montgomery Modular Multiplication (MMM)
R = A× B n ×n → 2n bits
q = (R × (−P−1)) mod (2n) n ×n → n bits
qP = q × P n ×n → 2n bits
A B
R
q
q
R
S
I Objective: A× B mod P
I Proposed in [7]
I Variants are actual state-of-the-art
for FP multiplication (with generic
P)
I Final reduction step discards n
LSBs
Modular Multiplication: Dependencies Problem
I In practice, MMM is interleaved
- Operands are split into s words of w bits such that n = s × w
- Iterations over partial products and reductions on words
- Coarsely Integrated Operand Scanning (CIOS) from [4]
I Impact on hardware implementation
- Dependencies → latencies between internal iterations
- Hardware pipeline in DSP slices cannot be filled efficiently
I Proposed solution: Hyper-Threaded Modular Multiplier
(HTMM)
- Based on simple CIOS algorithm
- Use idle stages to compute other independent MMMs in
parallel
HTMM Internal Architecture
I HTMM architecture: 3 hardware stages
- Stages are fully pipelined (several clock cycles per stage)
- 3 to 4 DSP slices in each stage
STAGE 1 STAGE 2 STAGE 3
Ai B + St =
Ai
t0qi =
B
qiS = + t
STAGE 1
STAGE 2
STAGE 3
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
A(0)
B(0)
A(1)
B(1)
A(2)
B(2)
P(0) P(1) P(2)
A(3)
B(3)
A(4)
B(4)
A(5)
B(5)
2
2
2
53
3
3
4
4
time
...
OPERANDS
RESULT
HTMM Internal Architecture (details)
Bj[33:17]
Bj[16:0]
Ai[16:0]
Ai[33:17]
tj[16:0]
Rj[33:17]
Rj[67:34]
P'0[16:0]
P'0[33:17]
P'0[16:0]
t0[16:0]
t0[33:17]
qi[16:0]
qi[33:17]
Acin
B
B
B
B
A
A
Acin
Pj[33:17]
Pj[16:0]
Pj[33:17]
Pj[16:0]
Mj[16:0]
Mj[33:17]
Mj[67:34]
Right wire shift by 17 bits
Right wire shift by 17 bits
C
C
PCIN
PCOUT
C
B
B
B
A
A
Acin
Right wire shift by 17 bits
C
PCIN
PCOUT
Acin
B
B
B
B
A
A
Acin
Right wire shift by 17 bits
Right wire shift by 17 bits
C
C
PCIN
PCOUT
OUTPUT
tj[33:0]
Sj[33:0]
HTMM Implementations
I Xilinx FPGAs
- Virtex 4 XC4VLX100 (V4)
- Virtex 5 XC5VLX110T (V5)
- Spartan 6 XC6SLX75 (S6)
I Comparison with fastest MMM implementation in literature
- Design presented in [6]
- Implemented on the same FPGAs for fair comparison
I 2 versions of HTMM:
- HTMM DRAM : operands stored in FPGA slices (LUTs)
- HTMM BRAM : operands stored in FPGA BRAMs
I Parameters for HTMM:
- P→ 128 bits
- w = 34 bits, s = 4
- Operands size n = s × w = 134 bits
HTMM Implementations Results
Results for 3 independent multiplications:
Unit FPGA DSP BRAM FF LUT Slices Freq. Nb. Time
18K/9K (MHz) cycles (ns)
[6] V4 21 6/0 1311 1201 879 252 258
V5 21 6/0 1310 1027 406 296 65 220
S6 21 0/6 1280 1600 540 210 309
HTMM V4 11 0/0 1638 1128 1346 330 239
DRAM V5 11 0/0 1616 652 517 400 79 198
S6 11 0/0 1631 1344 483 302 261
HTMM V4 11 2/0 615 364 449 328 241
BRAM V5 11 2/0 593 371 249 357 79 221
S6 11 0/2 587 359 180 304 260
S6: -47% DSPs, -66% BRAMs, -66% slices, -15% duration
For only 1 single M, HTMM is less efficient (69 cycles against 25)
Typical Architecture Model
Data
Memory
Global
Control
Program
Memory
Data DMUX
Data MUX
C
tr
l 
D
M
U
X
ADD/SUB MULTIPLIER OReg CSWAPOReg
Ctrl
Parameters specified at design time:
- Width w and nb. words s for internal communications
(s × w = n)
- Types and number of units
256b ECC vs 128b HECC (similar theoretical security)
FPGA Version DSP BRAM Slices Freq. Nb. Time
18K (MHz) cycles (ms)
V4
ECC 37 11 4655 250 109,297 0.44
H1 11 7 1413 330 183,051 0.55
H2 22 9 2356 330 115,211 0.35
V5
ECC 37 10 1725 291 109,297 0.38
H1 11 7 873 360 183,051 0.51
H2 22 9 1542 360 115,211 0.32
Gain H1 on V5: -70% DSPs, -30% BRAMs, -49% slices, +30%
duration
Gain H2 on V5: -40% DSPs, -10% BRAMs, -10% slices, -15%
duration
ECC results from [6]
Conclusions and Perspectives
I HTMM is more efficient than state of the art for 3
independent MMs
I leads to better area / computation time trade-offs
I more hardwired resources are active at each clock cycle
I µKummer based HECC is an efficient alternative to ECC
- More complex formulas but larger internal parallelism
- Large exploration space for architectures and arithmetic
I Future works
- Study other HTMM versions
- Study hyper-threaded schemes impact on energy consumption
- Study hyper-threaded schemes impact on side-channel leakage
References
[1] D. J. Bernstein and T. Lange.
Explicit-formulas database.
http://hyperelliptic.org/EFD/.
[2] Joppe W. Bos, Craig Costello, Huseyin Hisil, and Kristin Lauter.
Fast cryptography in genus 2.
Journal of Cryptology, 29(1):28–60, January 2016.
[3] Pierrick Gaudry.
Fast genus 2 arithmetic based on theta functions.
Journal of Mathematical Cryptology, 1(3):243–265, 2007.
[4] Çetin K. Koç, Tolga Acar, and Burton S. Kaliski, Jr.
Analyzing and comparing Montgomery multiplication algorithms.
Micro, IEEE, 16(3):26–33, June 1996.
[5] T. Lange.
Formulae for Arithmetic on Genus 2 Hyperelliptic Curves.
Applicable Algebra in Engineering, Communication and Computing, 15(5):295–328, February 2005.
[6] Yuan Ma, Zongbin Liu, Wuqiong Pan, and Jiwu Jing.
A high-speed elliptic curve cryptographic processor for generic curves over GF(p).
In Proc. 20th International Workshop on Selected Areas in Cryptography (SAC), volume 8282 of LNCS, pages
421–437. Springer, August 2013.
[7] Peter L. Montgomery.
Modular multiplication without trial division.
Mathematics of Computation, 44(170):519–521, April 1985.
[8] Joost Renes, Peter Schwabe, Benjamin Smith, and Lejla Batina.
µKummer: Efficient hyperelliptic signatures and key exchange on microcontrollers.
In Proc. Workshop on Cryptographic Hardware and Embedded Systems (CHES), volume 9813 of LNCS, pages
301–320. Springer, August 2016.
