Turkish Journal of Electrical Engineering and Computer Sciences
Volume 29

Number 1

Article 21

1-1-2021

Comparison of RISC-V and transport triggered architectures for a
postquantumcryptography application
LATİF AKÇAY
SIDDIKA BERNA ÖRS YALÇIN

Follow this and additional works at: https://journals.tubitak.gov.tr/elektrik
Part of the Computer Engineering Commons, Computer Sciences Commons, and the Electrical and
Computer Engineering Commons

Recommended Citation
AKÇAY, LATİF and YALÇIN, SIDDIKA BERNA ÖRS (2021) "Comparison of RISC-V and transport triggered
architectures for a postquantumcryptography application," Turkish Journal of Electrical Engineering and
Computer Sciences: Vol. 29: No. 1, Article 21. https://doi.org/10.3906/elk-2003-27
Available at: https://journals.tubitak.gov.tr/elektrik/vol29/iss1/21

This Article is brought to you for free and open access by TÜBİTAK Academic Journals. It has been accepted for
inclusion in Turkish Journal of Electrical Engineering and Computer Sciences by an authorized editor of TÜBİTAK
Academic Journals. For more information, please contact academic.publications@tubitak.gov.tr.

Turkish Journal of Electrical Engineering & Computer Sciences
http://journals.tubitak.gov.tr/elektrik/

Research Article

Turk J Elec Eng & Comp Sci
(2021) 29: 321 – 333
© TÜBİTAK
doi:10.3906/elk-2003-27

Comparison of RISC-V and transport triggered architectures for a postquantum
cryptography application
Latif AKÇAY∗, Berna ÖRS
Department of Electronics and Communication Engineering, Faculty of Electrical and Electronics Engineering,
İstanbul Technical University, İstanbul, Turkey
Received: 04.03.2020

•

Accepted/Published Online: 21.08.2020

•

Final Version: 27.01.2021

Abstract: Cryptography is one of the basic phenomena of security systems. However, some of the widely used publickey cryptography algorithms can be broken by using quantum computers. Therefore, many postquantum cryptography
algorithms are proposed in recent years to handle this issue. NTRU (Nth degree truncated polynomial ring units) is
one of the most important of these quantum-safe algorithms. Besides the importance of cryptography algorithms, the
architecture where they are implemented is also essential. In this study, we developed an NTRU public key cryptosystem
application and designed several processors to compare them in many aspects. We address two different architectures
in this work. The RISC-V (reduced instruction set computer-V) is chosen as it is the most lately version of classical
RISC architecture. As competitor to this, we preferred transport triggered architecture (TTA) which offers high level
customization and scalability. Details of all different implementations and the test results obtained with them are shared
and discussed.
Key words: Lattice-based cryptography, secure communication, application specific processor design, open source

1. Introduction
The importance of cryptography is especially increasing in recent years due to the need for information security.
Today, cryptography is widely used in many areas such as secure communication, data privacy, or secure
authentication [1]. Public key cryptography algorithms like Rivest–Shami–Adleman (RSA) and elliptic curve
cryptography (ECC) are believed to be secure enough for brute-force attacks and mathematical cryptanalysis
techniques done by using classical computers [2, 3]. However, researches done on development of quantum
computers has opened a new field in cryptography which is called postquantum cryptography [4].
Nth degree truncated polynomial ring units (NTRU) is a public key cryptosystem which is not known
to be breakable by using quantum computers [5]. It was proposed in 1996 by three mathematicians: Jeffrey
Hoffstein, Joseph H.Silverman, and Jill Pipher. Although it is not a new or widely used method, it is becoming
more and more important today due to the need for postquantum cryptography. NTRU features reasonably
short and easily created keys, has high speed, and requires low memory compared to RSA and ECC [5]. It was
the first public key cryptography algorithm that does not depend on integer factorization or discrete logarithm
problems [6]. In order to be safe against the attacks done by using quantum computers, algorithms such as
NTRU are strongly needed.
∗ Correspondence:

akcaylatif@gmail.com

321
This work is licensed under a Creative Commons Attribution 4.0 International License.

AKÇAY and ÖRS/Turk J Elec Eng & Comp Sci

Application-specific processors (ASIPs) are widely used in almost all areas of embedded electronics,
because the electronics industry needs low power consuming products that utilize small area or run at high
speed. More importantly, two or more of these demands are often desired together.
RISC-V and TTA (transport triggered architecture) are two very different processor architectures [7, 8].
While the RISC-V is the newest example of reduced instruction set computing (RISC) design concept, TTA is
more like very long instruction word (VLIW) architecture [9]. On the other hand, both of them can be used
for designing ASIPs and provide different kind of advantages. In this work, we developed a NTRU public key
cryptosystem application and run it on the processors which have RISC-V and TTA architectures. Then, we
analyzed speed, resource utilization, power and energy consumption of them.
The rest of the paper is organized as follows. We give related work in the second section. In the third
section, post-quantum cryptography and mathematical background of NTRU algorithm are explained in detail.
RISC-V and TTA processor architectures are summarized in the fourth section. Then, we share details of our
processor designs and our NTRU application in the fifth section. Finally, we give experimental results and
conclude the paper in the sixth and seventh sections, respectively.
2. Related works
To the best of our knowledge, there is not yet any work on comparison of RISC and TTA architectures for the
NTRU algorithm. Furthermore, there are not many studies about a general comparison of these architectures.
A related study on this topic is published by Pekka Jääskeläinen et al [10]. They compared a dual-issue TTA
processor with a multiissue VLIW and a single-issue RISC processors to evaluate the trade-offs between them.
Another important work was done by Yi Fan He et al [11]. They shared power consumption and performance
results of TTA and its RISC counterpart for IDCT (inverse discrete cosine transform), FIR, and histogram
applications. This study also introduces an improved TTA which aims to reduce its drawbacks.
There are a few papers on NTRU-specific processor design in literature. An energy eﬀicient implementation for small devices was done by Kaps [12]. Another low-cost implementation can be found in the work of
Ali Can Atıcı et al [13]. An eﬀicient GPU (graphics processing unit) implementation of NTRU was published
by Jens Hermans et al. by using the CUDA platform [14]. There is also an optimized polynomial arithmetic
library work done by Wei Dai et al [15]. It was introduced for accelerating ring operations on NVidia GPUs
(NVIDIA Cooperation, California, CA, USA).
3. Post-quantum cyrptography and NTRU
Decomposition of a composite integer number into it’s factors is called integer factorization problem. If the
number is large enough, solution of the problem is computationally ineﬀicient. In fact, this is the phenomenon
that enables the creation of today’s public key cryptography algorithms such as ECC and RSA. Research on
this subject has shown that these algorithms are still quite powerful against classical computers [16]. However,
this is not the case for quantum computers. According to the American mathematician Peter Shor, quantum
computers can solve factorization problems faster than the classical computers [17]. Shor’s algorithm shows that
a quantum computer works in polynomial time for a given factorization problem while classical computers works
in subexponential time. This is very threatening for widely used and very popular public key cryptography
algorithms such as RSA and ECC1 . For this reason, we need to design quantum-safe cryptosystems without
1 Tufts University. Computer System Security. [online]. Website http://www.cs.tufts.edu/comp/116/archive/fall2015/zkirsch.pdf
[accessed 24 December 2019]

322

AKÇAY and ÖRS/Turk J Elec Eng & Comp Sci

being late, as personal and sensitive information which are stored safely today may be deciphered ten years
later by a quantum computer.
The National Institute of Standards and Technology (NIST) started a process and called for submissions
for design, evaluation, and standardization of public key quantum-resistant cryptographic algorithms in 2016
2

. In January 2019, NIST has revealed the second round candidates which consist of 26 algorithms. There are

lattice-based, hash-based, code-based, and multivariate-quadratic-based approaches to the problem 3 .
3.1. NTRU public key cryptosystem
NTRU is a lattice-based approach for public key cryptography and mainly uses polynomial addition and
multiplication. The power of the algorithm comes from the hardness of the shortest vector problem (SVP)
in a lattice [18]. The SVP is to find the Euclidean length of a nonzero vector in a given lattice. Various versions
of the SVP is known to be NP-hard [19]. Besides, similar problems are defined such as closest vector problem
(CVP) and shortest independent ector problem (SIVP) in lattice mathematics [20, 21].
NTRU operations (polynomial addition, multiplication, and multiplicative inverse) are done in a polynomial ring R = Z[x]/(xN − 1). Multiplication of two polynomials in this ring refers to the cyclic convolution
of them [22]. Coeﬀicients of the obtained polynomials are reduced using either modulo q or modulo p and
sometimes are needed to be centerlifted (shifting the coeﬀicients in to a range). In addition, extended Euclidean
algorithm (EEA) is used to compute polynomial inversion operations [23]. More detailed information about the
mathematics of NTRU can be found in [5]. Here we just summarized the algorithm itself below.
The NTRU scheme uses three integer numbers (N, p, q) and three ring polynomials (f, g, r) such that;
• N is prime number and determines the maximum degree of ring polynomials,
• p and q are two relatively prime numbers,
• q must be much larger than p (in general p is taken as 3 ),
• f is a secret polynomial in the ring R with coeﬀicients in (−p/2, p/2)
• g is an initially secret polynomial in the ring R with coeﬀicients in (−p/2, p/2)
• r is a random blinding polynomial in the ring R with reduced coeﬀicients modulo p
NTRU key generation: After choosing f and g polynomials, it must be checked whether the
polynomial f has multiplicative inverses, F p and F q , over the ring R such that;
f ∗ F p = 1(modp)

(1)

f ∗ F q = 1((modq)

(2)

If not, another polynomial f must be selected. The secret keys of the system are determined as f and
F q . Public key polynomial h is calculated as follows;
h = F q ∗ g(modq)

(3)

2 NIST.

Post-Quantum Cryptography. [online] Website https://csrc.nist.gov/Projects/Post-Quantum-Cryptography [accessed 24
December 2019]
3 PQCRYPTO. Post-Quantum Cryptography. [online] Website https://pqcrypto.org/ [accessed 24 December 2019]

323

AKÇAY and ÖRS/Turk J Elec Eng & Comp Sci

User Alice hides the secret keys but reveals the public key h and parameters N, p, q to everyone.
NTRU encryption: Let’s suppose user Bob wants to send a message m to Alice. He must form the
message m in a ring polynomial with coeﬀicients in (−p/2, p/2). Then, he must use Alice’s public key and
compute the encrypted message as follows;
e = r ∗ h + m(modq)

(4)

where the obtained encrypted message e is again a polynomial in the ring R .
NTRU decryption: Alice decrypts the incoming message e to the original message m as follows;
a = f ∗ e(modq)

(5)

Then, Allice needs to centerlift the coeﬀicients of a into (−q/2, q/2).
d = F p ∗ a(modp)

(6)

Finally, Allice needs to centerlift the coeﬀicients of d in to (−p/2, p/2) which retrieves m;
m=d

(7)

According to literature, NTRU has many advantages over RSA and ECC [5]. Faster key generation
(especially for larger key sizes), faster encryption-decryption operations, and also low memory usage make
NTRU a very appropriate candidate for quantum-age public key cryptography applications.
4. Preferred architectures for comparison
Two different processor architectures are selected and examined for this work: RISC-V and TTA [7, 8]. There
are a few reasons for this choice. First of all, both RISC-V and TTA architectures are open source and royalty
free. The second reason is that the both of the architectures can be easily implemented on an FPGA. In
addition, the processors which have RISC and TTA architectures are customizable at different levels, usually
occupy smaller area and consume less power. We briefly introduce both of them below.
4.1. RISC-V
RISC-V is a new instruction set architecture (ISA) that was originally designed to support computer architecture
research and education, but it is also expected to become a standard, free, and open architecture for industrial
implementations [7].
The RISC-V ISA is available in 32, 64, and 128 bit versions. It includes a small base integer ISA and
optional standard extensions. Besides, it can be extended by other designers. However, base and standard
extensions were frozen by the RISC-V designers to provide the RISC-V compatibility. As a design philosophy,
the ISA avoids particular microarchitectural or implementation technology-dependent features. In addition, it
comes with a free BSD open source license that does not require patent to implement a RISC-V processor.
RISC-V project is maintained by RISC-V Foundation in University of Berkeley, California. The project
has attracted a great deal of attention all over the world. Thus, it is now called ”Linux of the hardware world”.
Therefore, the RISC-V ISA is much more prominent than the other alternative open source RISC architectures
324

AKÇAY and ÖRS/Turk J Elec Eng & Comp Sci

like OpenRISC 1000 [24]. More information about the RISC-V ISA and its implementations can be found in
oﬀicial web page 4 .
4.2. Transport triggered architecture
TTA is a highly customizable processor design approach in which the moving instructions on transport busses
trigger the functional units (FUs). In this respect, TTA has a similar methodology with the VLIW processors.
However, there are some differences between these two [25]. In the VLIW architecure, the FUs are always
connected to a multiport register file (RF), but in TTA there are multiple register files, and they are connected
to the interconnection network, not directly to the FUs. Therefore, in TTA processors, result of an operation
can directly be moved to another FUs instead of RF. This difference provides extensive register bypassing and
reduces data path complexity. A simple TTA processor structure is shown in Figure 1.
IMMEDIATE UNIT
DATA
MEMORY

GLOBAL CONTROL UNIT
FUNCTIONAL UNIT
REGISTER FILE

LOAD- STORE UNIT

LSU

FU

FU

IMU

RF

GCU

TRANSPORT
BUSES

INSTRUCTION
MEMORY

SOCKETS

Figure 1. General structure of a TTA processor.

There may be different number of transport buses in a TTA processor. Each bus can be connected to
the FUs, RFs, immediate units (IMUs) or global control units (GCUs) as shown in Figure 1. The architecture
is very appropriate for instruction level parallelism (ILP). So, TTA processors can accelerate many applications
significantly. In addition to ILP, one can design custom FUs for a specific application and integrate it to the
processor. Custom FU design ability makes TTA a very good alternative for ASIP design and development.
4.3. TTA-based codesign environment
TTA-based codesign environment (TCE) is an open source tool set for designing TTA processors developed by
Tampere University. By using TCE, one can create a TTA processor, compile a program for this processor,
simulate a code, analyze the performance, and generate HDL implementation of the design [26]. Additionally,
there are many useful tools with well-designed graphical user interfaces (GUIs).
4 RISC-V

Foundation. [online] Website https://riscv.org/ [accessed 24 December 2019]

325

AKÇAY and ÖRS/Turk J Elec Eng & Comp Sci

TCE takes C, C++, and OpenCL source codes as input. Also, the tools need an architecture definition
file (.adf) that contains the design definitions of the template processor. LLVM compiler is used for compiling
and generating architecture-specific machine codes. A simplified overview of the TCE is shown in Figure 2.
Detailed information about the tool set can be obtained from the oﬀicial document [26].

Figure 2. Overview of TTA codesign environment [26].

5. The NTRU application and prepared processors
The data memory required by the NTRU codes available in the literature is too large for the RISC processors
used in this study. So, we developed a light-weight C application which realize key generation, encryption,
and decryption phases of NTRU public key cryptosystem. At the end of the decryption phase, plain text and
decrypted messages are compared to ensure that the application works correctly. The application is compiled
and debugged with both RISC-V and TCE toolchains. It is portable to any system, as we only used the standard
C libraries.
Eight different processor designs have been prepared for NTRU application. Five different TTA processors
have been designed for comparison with the selected three different RISC-V processors.
326

AKÇAY and ÖRS/Turk J Elec Eng & Comp Sci

5.1. Selected RISC-V processors
The RISC-V ISA is defined as a base integer ISA, which must be present in any implementation plus optional
extensions to the base ISA [7]. Firstly, we have selected a very simple RISC-V processor that implements a
classical five-stage pipeline. Potato is an open source processor written in VHDL 5 . The processor supports
32-bit RISC-V base instruction set (RV32I).
PULPino is also an open source, configurable, single-core, 32-bit microcontroller system designed by
ETH Zurich and University of Bologna [27]. Riscy version of the core includes four-stage pipeline and has fully
support for RISC-V base, compressed and multiplication instruction sets (RV32IMC). Overall structure of Riscy
core is shown in Figure 3.

Figure 3. Overview of Riscy core [27].

Ibex is a small and eﬀicient, 32-bit, in-order RISC-V core with a two-stage pipeline that implements the
RV32IMC instruction sets. It is based on a simpler version of Riscy core which is called Zero-riscy [28]. Ibex is
an area optimized processor and aimed to achieve low power consumption. The project is further developed by
lowRISC, which is a nonprofit company 6 . The block design of the core is demonstrated in Figure 4.
Although there are more advanced RISC-V candidates in the literature, we chose these three processors
because of small area and low power features. For a fair comparison with TTA designs, peripherals such as
UART, GPIO, timer, and instruction cache are removed from the RISC processors.
5.2. Designed transport triggered architecture processors
We used TCE for designing the TTA processors. First of all, the processor TTA-P1 is designed using processor
designer (ProDe) of the toolset. As seen in Table 1, TTA-P1 includes one bus, one arithmetic unit (ART),
one logic unit (LOG), one shift unit (SHF), which are all some kind of FUs. Additionally, we integrated one
40 x 32-bit and one 2 x 1-bit register files. Global control unit (GCU) and load-store unit (LSU) must be
5 The Potato project. Processor datasheet.[online] Website https://github.com/skordal/potato/blob/master/docs/, [accessed 24
December2019]
6 lowRISC. Ibex User Manual. [online] Website https://ibex-core.readthedocs.io/en/latest/ [accessed 24 December 2019]

327

AKÇAY and ÖRS/Turk J Elec Eng & Comp Sci

Figure 4. Overview of Zero-riscy core [28].

used for data memory connection and general functionality of the system. The detailed structure of the TTAP1 processor is shown in Figure 5. After the design is complete, the NTRU application is compiled for this
architecture. Then, the TCE simulator (proxim) is used for analyzing total number of cycles (NoC) that mostly
used FUs and RF occupation statistics. Other TTA processors are designed according to the results obtained
from these analyzes. The block design of TTA-P1, TTA-P2, and TTA-P3 processors are demonstrated in
Figure 5. Similarly, configurations of the processors TTA-P4 and TTA-P5 are shown in Figure 6, respectively.
As shown in Table 1, additional FUs like MUL (multiplication), ADD (addition), DIV-MOD (division and
modulo operation) are connected to the processors to improve performance. Besides, more transport buses,
LSUs and RFs are also added to the designs.
The ART unit includes simple arithmetic operations which are addition, subtraction, equality check, and
greatness comparison. The SHF unit is responsible from the shifting operations to right or to left. Basic logical
operations AND, OR, and XOR are done by the LOG unit. The GCU manages jump and call operations of a
running program while the LSU load or store data in varying lengths from 8-bit to 32-bit. The TCE makes it
possible to modify the operations contained in these blocks or to add completely new units. However, all units
are designed to have exactly the same configuration on the processors in our study.
Instruction width (IW) of the proposed designs can be seen in the rightmost column of Table 1. VHDL
implementations of the processors were generated by using the processor generator (ProGe) tool. We made
synthesis and simulation of all designs for the same FPGA (xc7a100t-1) by using Xilinx Vivado [29, 30]. Estimation of power and energy analyzes were performed with Vivado power analyzer by generating postsynthesis
simulation activity file (SAIF). We have evaluated NoC values as a performance indicator. Resource utilization
results are given in terms of number of look-up tables (LUT) and registers (SREG) used in the designs. Also,
instantaneous power values and total energy consumption values are given in Watt and Joule units. We obtained
the results for all RISC-V and TTA processors. All integrated FUs and other architectural units included in
designs can be found in Table 1. Also, we share experimental test results for NTRU application running on the
processors in Table 2.
328

AKÇAY and ÖRS/Turk J Elec Eng & Comp Sci

LSU

ART

LOG

SHF

RF

RF

GCU

TTA-P1

LSU

ART

ART

LOG

SHF

RF

RF

GCU

LSU

ART

ART

LOG

SHF

RF

RF

RF

TTA-P2

LSU

GCU

TTA-P3

Figure 5. Configuration of TTA-P1, TTA-P2, and TTA-P3 processors.

LSU

LSU

ART

ART

LOG

SHF

ADD

ADD

MUL

RF

RF

GCU

RF

TTA-P4

LSU

LSU

ART

ART

LOG

SHF

ADD

ADD

MUL

DIV_MO
D

RF

RF

RF

GCU

TTA-P5

Figure 6. Configuration of TTA-P4 and TTA-P5 processors.

6. Result and discussion
Comparison of instantaneous power consumption and performance, in terms of run time of the application,
can be seen in Figure 7. Likewise, the overall energy consumption and resource utilization values for all
eight processors are indicated in Figure 8. Although it includes five-stage pipeline, it is obvious that the
RV32I processor exhibits the worst results on performance and energy. It can be seen that these values are
improved considerably with the addition of the multiplication and compressed instruction sets on RV32IMC
processors. However, the integration of these instructions increases the resource utilization values reasonably.
TTA processors offered better results in terms of instantaneous power and resource utilization when compared
to RISC-V processors. The overall energy consumption results are particularly striking. In this sense, TTA
processors are undoubtedly more advantageous. Resource utilization of TTA processors increases as the number
of parallel buses and functional units increase. But, we think that this is acceptable as performance level
329

AKÇAY and ÖRS/Turk J Elec Eng & Comp Sci

Table 1. Architectural details of compared RISC-V and TTA processors.

Processor
RV32I
RV32IMC
RV32IMC
TTA-P1
TTA-P2
TTA-P3
TTA-P4
TTA-P5

Bus
1
1
1
1
2
4
4
4

Function units
Base instructions, 5-stage
Base, multiplication, compressed instructions, 2-stage
Base, multiplication, compressed instructions, 4-stage
1xART, 1xLOG, 1xSHF
2xART, 1xLOG, 1xSHF
2xART, 1xLOG, 1xSHF
4xART, 1xLOG, 2xSHF, 1xMUL, 2xADD
4xART, 1xLOG, 2xSHF, 1xMUL, 2xADD, 1xDIV-MOD

LSU
1
1
1
1
1
2
2
2

RF
1
1
1
1xRF,
1xRF,
2xRF,
2xRF,
2xRF,

1xBL
1xBL
1xBL
1xBL
1xBL

IW
32
32
32
43
86
176
176
176

Table 2. Experimental results of NTRU application on RISC-V and TTA processors.

Processor
RV32I
RV32IMC
RV32IMC
TTA-P1
TTA-P2
TTA-P3
TTA-P4
TTA-P5

Frequency (MHz)
50
50
40
125
117
100
100
90

NoC
788173585
94398939
52489786
209124976
121912681
85925831
70823327
26690829

Area (LUT, SREG)
2990, 2044
2896, 1925
7521, 2599
1069, 1877
1585, 2084
2943, 2523
3107, 2772
4495, 4875

Power (W)
0.219
0.224
0.238
0.120
0.126
0.135
0.144
0.192

Energy (J)
3,452
0.422
0.312
0,200
0.124
0.103
0.085
0.056

improves significantly and energy consumption decreases. All of these results approve that TTA may be a serious
option to design application specific processors for NTRU based systems. It also appears that TTA processors
may be a particularly good choice for all Lattice-based cryptography applications. However, this should not
be perceived as a conclusive result. Because both architectures offer quite a number of customization facilities.
For instance, RISC-V processors can be designed to deliver higher performance with standard and nonstandard
instruction set extensions. In addition, more stages of pipeline or out-of-order design methodologies can be
implemented. Of course, all of these techniques may lead to increase the required chip area and instantaneous
power. Nevertheless, the energy consumption is expected to be less.
As stated in the previous paragraph, performance increases rapidly as the number of parallel buses
increases for TTA processors. This can be considered as a natural result of instruction-level parallelism. In
the case of our NTRU application, we have experienced that the performance does not change much even if
the number of buses is more than four for TTA-P3, TTA-P4, and TTA-P5 processors. Similarly, as seen in
RISC-V processors, another important factor affecting the performance is the addition of custom FUs. In this
way, TTA-P5 processor reaches the best values in terms of performance. But, this improvement causes to
double the required LUT and slice registers. However, that does not raised the instantaneous power level much.
Furthermore, total energy consumption improves considerably. It can be further enhanced with further analysis
of the processor design.
Of course, custom peripherals, more special FUs or custom instruction set extension methods may be
used for both RISC-V and TTA processors. We plan to make various processor implementations to apply these
330

AKÇAY and ÖRS/Turk J Elec Eng & Comp Sci

16.000

Performance (s)

15.763

Instantaneous Power (W)
0.25

14.000

0.219

0.238

0.224

0.192

0.2

12.000
10.000

0.15

8.000
6.000

0.121

0.135

0.126

0.144

0.1

4.000

1.888

2.000

1.312

1.673

1.036

0.859

0.708

0.05

0.296

0.000

RV32I RV32IMCRV32IMC TTA-P1 TTA-P2 TTA-P3 TTA-P4 TTA-P5
(Potato) (Ibex) (Pulpino)

0

RV32I RV32IMCRV32IMC TTA-P1 TTA-P2 TTA-P3 TTA-P4 TTA-P5
(Potato) (Ibex) (Pulpino)

Figure 7. Comparison of performance and instantaneous power consumption.

Resource Utilization

Energy Consumption
n (J)
3.5

3.45

7000
6000

3

4495

5000

2.5

4000

2

3000

1.5

2000
1000

1

0.5

7521
7521

8000

4

0.42

0.34

0.2

0.12

0.103

2990
2044

2896

3107

2943

2599
1925

1877
1069

2084

2523

4875

2772

1585

0

0.08

0.056

0

RV32I RV32IMCRV32IMC TTA-P1 TTA-P2 TTA-P3 TTA-P4 TTA-P5
(Potato) (Ibex) (Pulpino)

RV32I RV32IMC RV32IMC TTA-P1
(Potato)
(Ibex) (Pulpino)

LUT

TTA-P2

TTA-P3

TTA-P4

TTA-P5

SREG

Figure 8. Comparison of energy consumption and resource utilization.

options and extend the comparisons in our next study. Another factor, which is likely to affect the results,
is the implementation way of the NTRU algorithm in software. For example, using different multiplication or
division techniques can seriously improve performance while reducing the energy consumption. These kinds of
analyzes are thought to be done in future studies. In addition, other alternative processor architectures should
be compared to determine the best option even if it may not be possible for all criteria.
7. Conclusion
NTRU is one of the most important candidate for quantum-resistant public key cryptography. Thus, the most
eﬀicient architectures for such algorithms should be investigated in order to construct secure communication
systems both today and in the future. In this work, we developed a public key cryptosystem application based on
NTRU algorithm especially suitable for light-weight devices. Also, we designed five TTA processors to compare
them with tree different RISC-V counterparts for the application. We implemented our designs on the same
FPGA and tried to establish an equitable environment to be able to reach consistent findings. The comparison
is made in terms of area, performance, power, and energy consumption. The other purpose of this study is
to analyse the capabilities of TTA processors on lattice-based post-quantum cryptography applications. Based
on the test results given in the previous section, we think that TTA processors offer considerable potential
for lattice-based cryptography algorithms such as NTRU. They seem to be very advantageous compared to
RISC-V alternatives especially in terms of performance and energy consumption. However, we still think that
331

AKÇAY and ÖRS/Turk J Elec Eng & Comp Sci

more comparisons should be made with different scenarios, and we plan to do so for the future work.
References
[1] Stallings W. Cryptography and Network Security: Principles and Practice. Upper Saddle River, NJ, USA: Pearson,
2017.
[2] Bhanot R, Rahul H. A review and comparative analysis of various encryption algorithms. International Journal of
21 Security and Its Applications 9.4 2015; 9: 289-306.
[3] Bos J, Kaihara M, Kleinjung T, Lenstra A, Montgomery P. On the security of 1024-bit RSA and 160-bit elliptic
curve cryptography. IACR Cryptology ePrint Archive 2009; 389.
[4] Bernstein DJ. Introduction to post-quantum cryptography. In: Bernstein Daniel J, Buchmann J, Dahmen E
(editors). Post-Quantum Cryptography. Heidelberg, Berlin: Springer, 2009, pp. 1-14.
[5] Hoffstein J, Pipher J, Joseph H S. NTRU: A ring-based public key ryptosystem. In: Buhler J.P. (editor) Algorithmic
Number Theory. ANTS. Lecture Notes in Computer Science, vol 1423. Heidelberg, Berlin: Springer, 1998, pp. 267288.
[6] Yan S Y. Integer factorization and discrete logarithms. In: Yan Song Y (editor). Primality Testing and Integer
Factorization in Public-Key Cryptography. Boston, MA, USA: Springer, 2009, pp. 209-285.
[7] Waterman A, Lee Y, Patterson D.A, Asanovi K. The RISC-V Instruction Set Manual, Volume I: Base User-Level
ISA. Department of Electrical Engineering and Computer Sciences University of Berkeley at California, Technical
Report No. UCB/EECS-2014-54. California, USA: 2014.
[8] Corporaal H. Design of transport triggered architectures. In: Proceedings of 4th Great Lakes Symposium on VLSI.
Notre Dame, USA; IEEE, 1994. pp. 130-135. doi: 10.1109/GLSV.1994.289981.
[9] Alexandru N, Joseph A. F. Measuring the parallelism available for very long instruction word architectures. In:
IEEE Transactions on Computers, vol. C-33, no. 11, 1984. 968-976 doi:10.1109/TC.1984.1676371
[10] Jääskeläinen P, Tervo A, Vayá G. P, Viitanen T, Behmann N, et al. Transport-triggered oft cores. In: 2018 IEEE
International Parallel and Distributed Processing Symposium Workshops (IPDPSW). Vancouver, BC; IEEE, 2018.
doi: 10.1109/IPDPSW.2018.00022
[11] Yifan H, She D, Mesman B, Corporaal H. MOVE-Pro: A low power and high code density TTA architecture. In:
2011 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation, SAMOS.
IEEE, 2011. pp. 294-301. doi: 10.1109/SAMOS.2011.6045474
[12] Kaps J. Cryptography for Ultra-Low Power Devices. PhD, Worcester Polytechnic Institute, MA, USA, May 2006.
[13] Atici A.C, Batina L, Fan J, Verbauwhede I, Yalcin S.B. Low-cost implementations of NTRU for pervasive security.
In: International Conference on Application-Specific Systems, Architectures and Processors. Leuven; IEEE, 2008.
pp. 79-84. doi: 10.1109/ASAP.2008.4580158.
[14] Hermans J, Vercauteren F, Preneel B. Speed records for NTRU. Cryptographers’ track at the RSA conference.
Lecture Notes in Computer Science, vol 5985; Heidelberg, Berlin, Germany: Springer, 2010, pp.73-88.
[15] Dai W, Doröz Y, Sunar B. Accelerating NTRU based homomorphic encryption using GPUs. In: 2014 IEEE High
Performance Extreme Computing Conference (HPEC). Waltham, MA; IEEE, 2014. pp. 1-6.
[16] Kleinjung T, Aoki K, Franke J, Lenstra A.K, Thomé E et al. Factorization of a 768-bit RSA modulus. In: Annual
Cryptology Conference. Heidelberg, Berlin: Springer, 2010. pp. 333-350.
[17] Peter W. S. Polynomial-time algorithms for prime factorization and discrete logarithms on a quantum computer.
SIAM Journal on Computing (5), 1997. 1484-1509. doi:10.1137/s0097539795293172
[18] Micciancio D. On the hardness of the shortest vector problem. PhD, Massachusetts Institute of Technology, MA,
USA, 1998.

332

AKÇAY and ÖRS/Turk J Elec Eng & Comp Sci

[19] Ajtai M. Generating hard instances of lattice problems. In: Proceedings of the Twenty-eighth Annual ACM
Symposium on Theory of Computing. Philadelphia, Pennsylvania, USA, 1996. pp. 99-108.
[20] Micciancio D. The hardness of the closest vector problem with preprocessing. In: IEEE Transactions on Information
Theory 47.3; 2001. pp. 1212-1215.
[21] Chris P. Public-key cryptosystems from the worst-case shortest vector problem. In: Proceedings of the Forty-first
Annual ACM Symposium on Theory of Computing. ACM, 2009.
[22] O’Rourke C, Sunar B. Achieving NTRU with Montgomery multiplication. In: IEEE Transactions on Computers
52.4, 2003. pp. 440-448.
[23] Anton I, Kyurkchiev N, Asen Rahnev. A note on adaptation of the Knuth ′ s extended Euclidean algorithm for
computing multiplicative inverse. International Journal of Pure and Applied Mathematics 118.2, 2018. 281-290.
doi:10.12732/ijpam.v118i2.13
[24] Akcay L, Tukel M, Ors B. Design and implementation of an OpenRISC system-on-chip with an encryption peripheral. In: IEEE European Conference on Circuit Theory and Design (ECCTD); Catania; 2017. pp. 1-4. doi:
10.1109/ECCTD.2017.8093340
[25] Mäntyneva J. Automated Design Space Exploration of Transport Triggered Architectures. PhD, Tampere University
of Technology, Tampere, Finland, 2009.
[26] Jääskeläinen P, Esko O, Kultala H, Guzma V, Salminen E et al. TTA-based Co-design Environment v1.18 User
Manual. Department of Pervasive Computing, Tampere University of Technology, Finland, 2018.
[27] Traber A, Gautschi M. PULPino: Datasheet. ETH Zurich, University of Bologna, 2017.
[28] Schiavone PD, Conti F, Rossi D, Gautschi M, Pullini A et al. Slow and steady wins the race? A comparison
of ultra-low-power risc-v cores for internet-of-things applications. In: 2017 27th International Symposium on
Power and Timing Modeling, Optimization and Simulation (PATMOS); Thessaloniki, Greece; 2017. pp. 1-8. doi:
10.1109/PATMOS.2017.8106976
[29] Przybus B. Xilinx redefines power, performance, and design productivity with three new 28 nm fpga families:
Virtex-7, kintex-7, and artix-7 devices. Xilinx White Paper WP373 (v1.0), 2010.
[30] Tom F. Vivado design suite. Xilinx White Paper WP416 (v1.1), 2012.

333

