RVCoreP : An optimized RISC-V soft processor of five-stage pipelining by Miyazaki, Hiromu et al.
ar
X
iv
:2
00
2.
03
56
8v
1 
 [c
s.A
R]
  1
0 F
eb
 20
20
IEICE TRANS. INF. & SYST., VOL.Exx–??, NO.xx XXXX 2020
1
PAPER Special Section on Parallel, Distributed, and Reconfigurable Computing, and Networking
RVCoreP : An optimized RISC-V soft processor of five-stage
pipelining
Hiromu MIYAZAKI†a), Student Member, Takuto KANAMORI†b), Md ASHRAFUL ISLAM†c), Nonmembers,
and Kenji KISE†d), Member
SUMMARY RISC-V is a RISC based open and loyalty free instruction
set architecture which has been developed since 2010, and can be used for
cost-effective soft processors on FPGAs. The basic 32-bit integer instruc-
tion set in RISC-V is defined as RV32I, which is sufficient to support the
operating system environment and suits for embedded systems.
In this paper, we propose an optimized RV32I soft processor named
RVCoreP adopting five-stage pipelining. The processor applies three ef-
fective optimization methods to improve the operating frequency. These
methods are instruction fetch unit optimization including pipelined branch
prediction mechanism, ALU optimization, and data alignment and sign-
extension optimization for data memory output. We implement RVCoreP
in Verilog HDL and verify the behavior using Verilog simulation and an
actual Xilinx Atrix-7 FPGA board. We evaluate IPC (instructions per
cycle), operating frequency, hardware resource utilization, and processor
performance. From the evaluation results, we show that RVCoreP achieves
30.0% performance improvement compared with VexRiscv, which is a high-
performance and open source RV32I processor selected from some related
works.
key words: soft processor, FPGA, RISC-V, RV32I, Verilog HDL, five-stage
pipelining
1. Introduction
RISC-V [1] is becoming popular as an open and loyalty free
instruction set architecture (ISA) which has been developed
at the University of California, Berkeley since 2010. It can
be used for cost-effective soft processors on FPGAs like
MicroBlaze [2] and Nios II [3].
The RISC-V ISA is defined as a basic integer instruc-
tion set and other extended instruction sets, and we can sup-
port necessary instruction sets by the application require-
ments [4]. The basic 32-bit integer instruction set is defined
as RV32I. Other typical extended instruction sets are defined
asM for integermultiplication and division instructions,F for
single-precision floating-point ones, D for double-precision
floating-point ones, and A for atomic ones. In addition to
these, a 32-bit general-purpose instruction set is defined as
RV32G as the set of RV32I, M, A, F, and D. This is an
instruction set architecture for general-purpose computing
systems of a broad range. RV64G is a 64-bit version of a
general-purpose instruction set.
Among these instruction sets, we focus on RV32I in
Manuscript received January 7, 2020.
†The authors are with the School of Computing, Tokyo Institute
of Technology.
a) E-mail: miyazaki@arch.cs.titech.ac.jp
b) E-mail: kanamori@arch.cs.titech.ac.jp
c) E-mail: ashraful@arch.cs.titech.ac.jp
d) E-mail: kise@c.titech.ac.jp
DOI: 10.1587/transinf.E0.D.1
this paper because it is sufficient to support the operating
system environment and suits for embedded systems. RV32I
can emulate other extensions of M, F, and D, and can be
configured with fewer hardware resources than processors
supporting RV32G. Although several soft processors that
support RV32I have been released [5], they are not highly
optimized for FPGAs.
In this paper, we propose an optimized RV32I soft pro-
cessor named RVCoreP of five-stage pipelining which is
highly optimized for FPGAs. The main contributions of this
paper are as follows.
• We propose an optimized RV32I soft processor of five-
stage pipelining highly optimized for FPGAs. To im-
prove the operating frequency, it applies instruction
fetch unit optimization including pipelined branch pre-
diction mechanism, ALU optimization, and data align-
ment and sign-extension optimization for data memory
output.
• We implement the proposal in Verilog HDL and evalu-
ate IPC (instructions per cycle), operating frequency,
hardware resource utilization, and processor perfor-
mance. From the evaluation results, we show that the
proposed processor achieves much better performance
than VexRiscv, which is a high-performance and open
source RV32I processor.
2. Related works
Rocket Core [6] is a RISC-V in-order scalar processor de-
veloped by the University of California, Berkeley. It is a
pipelined processor supporting RV32G and RV64G. It sup-
ports processing of privilege levels, and has anMMU (mem-
ory management unit) with virtual memory and data cache,
and a branch prediction unit. Because of this rich function-
ality and hard customization, it is not suitable for embedded
systems.
Rocket Core has another drawback. It is written in
Chisel [7], a domain-specific language based on Scala. Be-
cause Chisel is a new hardware description language since
2012, it may be difficult for hardware developers who have
not mastered Chisel to change the design effortlessly. Ac-
cording to the work [5], VerilogHDL and SystemVerilog are
the dominant languages used to implement the processors,
and they may be the best choice for easy-to-use processor
implementations. Therefore, we implement our processors
Copyright © 2020 The Institute of Electronics, Information and Communication Engineers
2
IEICE TRANS. INF. & SYST., VOL.Exx–??, NO.xx XXXX 2020
M
u
x
Id stage
IfId_ir
Ex_rrs1
Ex_rrs2
Ex stage
ExMa_rslt
Ma stage
IdEx ExMa MaWb
Wb_rslt
A
LU
M
u
x
M
u
x
M
u
x
Ex_rslt
+ D_IN Ma_rslt
Ex_b_rslt
regfile
M
u
x
Wb_rslt
ExMa_rslt
decoder
_im
m
Id_imm
Id_imm
alig
n/e
xte
nd
If stage Wb stage
L
oad
-use
IdEx_rd
Id_rs1
Id_rs2
m_dmem
+4
m_imem
w_npc
m_BTB
r
_pc
m_PHT
r
_B
H
R
+
w_btb
ExMa_pc_true
r_pc+4
r_pc
join
co
m
b
w_bmis
w_stall
IfId
w_btkn
co
m
b
M
u
x
Id_luse
Id_rrs1
Id_rrs2
Id_rs1
Id_rs2
Wb_rslt
I_INI_ADDR
IdEx_imm
IdEx_rrs1
D_ADDR
shift
left
 1
+
IdEx_pc
IfId_pc
Fig. 1 A block diagram of typical five-stage pipelined processor (baseline).
in Verilog HDL, a dominant hardware description language.
VexRiscv [8] is a RISC-V pipelined soft processor sup-
porting RV32I. The integermultiplication and division,other
extensions, and the MMU with instruction cache and data
cache can be added as options. In addition, the branch pre-
diction scheme, implementation choice of shift instruction,
data forwarding path, and so on can be tuned for imple-
mentation. VexRiscv is written in an open source and new
hardware description language called SpinalHDL [9], and
the corresponding RTL description can be generated as a
Verilog HDL file. Since the generated Verilog HDL code is
not hierarchical, debugging and understanding this generated
code is not easy.
VexRiscv has won the 1st place at the highest-
performance implementation category of the RISC-V Soft-
CPU Contest in 2018 hosted by the RISC-V Founda-
tion [10]. Therefore, it is an optimized soft processor for
high-performance, and the highest performance RV32I soft
processor available as an open source as far as we know. We
use VexRiscv as a reference for making the comparison with
our proposed processors.
There are other RISC-V processors for education such
as riscv-mini [11] and Sodor Processor [12] both are de-
veloped by the University of California, Berkeley, and
Clarvi [13] developed by the University of Cambridge.
These educational RISC-V processors are easy-to-use, but
their performance is not high as VexRiscv because they are
not highly optimized for high-performance.
3. Design of a typical five-stage pipelined processor
We design a typical five-stage pipelined processor with
branch prediction referring [14], and this design is used as a
baseline for the proposal.
Fig.1 shows a block diagram of the baseline consists of
five-stage indicated by the instruction fetch stage (If stage),
instruction decode stage (Id stage), instruction execution
stage (Ex stage), memory access stage (Ma stage), and write
back stage (Wb stage).
The green rectangles are registers that are updated at
the positive clock edge. The yellow rectangles are modules
including the memory which is composed of block RAM
on Xilinx FPGA. The gray rectangle is a register file read
asynchronously consisting of 32 registers. The red mod-
ules are an ALU or adders, and the other blue modules are
combinational circuits.
The baseline has an instructionmemory namedm_imem
shown at the bottom left of the figure, and a data memory
named m_dmem shown at the right of the figure.
The branch prediction scheme is gshare [15]which con-
tains a branch history register (BHR) named r_BHR, a pat-
tern history table (PHT) named m_PHT, and a branch target
buffer (BTB) named m_BTB. To mitigate the data hazard, it
has two forwarding paths. The red path from Ma stage to
Ex stage provides the register value for the next dependent
instruction. Similarly, the blue path provides register value
from the Wb stage to the Ex stage.
In the If stage, the instruction is fetched from the instruc-
tion memory using the program counter (PC) as an address.
The register for PC named r_pc is updated in every cycle
with the next PC value named w_npc.
There are four candidates for w_npc in following de-
scending priority order. The highest priority one is the cor-
rect PC value named ExMa_pc_true from the Ma stage. The
second priority one is the current PC value from r_pc in
case of pipeline stalling. The third priority one is the branch
target address named w_btb which is output from the BTB.
The lowest priority one is r_pc+4 for the instruction of the
next address.
There are three control signals to select the proper one
among four candidates. The first signal is named w_bmis
which indicates whether a branch misprediction has oc-
MIYAZAKI et al.: RVCOREP : AN OPTIMIZED RISC-V SOFT PROCESSOR OF FIVE-STAGE PIPELINING
3
curred. The second one is namedw_stall for pipeline stalling
due to the data dependency on the load instruction. The third
one is namedw_btkn from branch predictor to provide a pre-
diction result as predicted taken or not taken.
In the baseline, the path that determines the next PC
value from the four candidates through a multiplexer using
three control signals is the critical path that determines the
maximum operating frequency. The next critical path is the
data path to store the executed result in the Ex stage from
ALU which uses two data forwarding values. Another slow
path is aligning and sign-extending the values of reading data
from the data memory on the Ma stage which will be stored
into the MaWb pipeline register.
In our proposed processor, the operating frequency is
improved by optimizing these critical paths.
4. Design and implementation of RVCoreP
In this section, we propose an optimized RV32I soft pro-
cessor named RVCoreP (RISC-V core pipelined version).
Firstly, we describe three optimization methods. Then, we
describe the design and implementation of our proposal.
4.1 ALU optimization
The data path to store the executed result in ALU using two
data forwarding values to the ExMa pipeline register is the
critical path in the baseline design. To mitigate the delay of
this critical path, we discuss the ALU optimization scheme.
According to the related work [16], the circuit speed is
faster by using exclusive OR instead of multiplexer to select
the operation result for the ALU optimization on FPGA.
Therefore, in this design of RVCoreP, exclusive OR is used
to select the 32-bit executed result of ALU.
As mentioned in the related work [17], one-hot encod-
ing is used instead of the usual binary encoding for the control
signal generation to select the ALU calculation result. As
only one bit of the bit vector is 1 and the other bits are 0,
and the control decisions are determined by the correspond-
ing flip-flop bit in parallel. Therefore, the proposal adopts a
one-hot encoding for ALU.
The code 1 is the simplified description of a typical
ALU in the baseline where some operations of RV32I are
excluded. The register named r_rslt is the executed result of
ALU. This value is selected by the 3-bit signal named sel,
which is described in a case statement from line 7 to line 16.
Since this description is mapped to hardware as amultiplexer
that selects one from eight values, this circuit takes a certain
time through several LUTs.
The code 2 is the simplified description of the opti-
mized ALU equivalent to the previous description in the
code 1. The executed result of ALU named rslt is selected
from eight values including 0 using exclusive OR on line 12.
Each selected value is determined in advance using small
multiplexers by a one-hot encoded selection signal named
sel of 8-bit. Since this scheme can select a value without
using a largemultiplexer, this circuit is faster than the typical
Code 1 The simplified description of a typical ALU.
1 module ALU (in1, in2, sel, rslt);
2 input wire [31:0] in1, in2;
3 input wire [2:0] sel;
4 output wire [31:0] rslt;
5 reg [31:0] r_rslt;
6 always @(*) begin
7 case(sel)
8 0 : r_rslt = in1 + in2; // add
9 1 : r_rslt = in1 - in2; // sub
10 2 : r_rslt = in1 ^ in2; // ex-or
11 3 : r_rslt = in1 | in2; // or
12 4 : r_rslt = in1 & in2; // and
13 5 : r_rslt = in1 << in2[4:0]; // shift left
14 6 : r_rslt = in1 >> in2[4:0]; // shift right
15 default : r_rslt = 0;
16 endcase
17 end
18 assign rslt = r_rslt;
19 endmodule
Code 2 The simplified description of the optimized ALU.
1 module ALU_opt (in1, in2, sel, rslt);
2 input wire [31:0] in1, in2;
3 input wire [7:0] sel;
4 output wire [31:0] rslt;
5 wire [31:0] w0 = (sel[0]) ? in1 + in2 : 0;
6 wire [31:0] w1 = (sel[1]) ? in1 - in2 : 0;
7 wire [31:0] w2 = (sel[2]) ? in1 ^ in2 : 0;
8 wire [31:0] w3 = (sel[3]) ? in1 | in2 : 0;
9 wire [31:0] w4 = (sel[4]) ? in1 & in2 : 0;
10 wire [31:0] w5 = (sel[5]) ? in1 >> in2[4:0] : 0;
11 wire [31:0] w6 = (sel[6]) ? in1 << in2[4:0] : 0;
12 assign rslt = w0 ^ w1 ^ w2 ^ w3 ^ w4 ^ w5 ^ w6;
13 endmodule
one.
The preliminary evaluation of the operating frequency
of the ALU alone targetting Xilinx Artix-7 FPGA showed
that the frequency of the typical ALU was 230MHz while
the frequency of the optimized ALU was 240MHz. This
optimization is expected to improve the operating frequency
of ALU by about 10MHz.
4.2 Alignment and sign-extension optimization
After applying the ALU optimization and the instruction
fetch unit optimization described later, the critical path is the
data memory access and the alignment and sign-extension of
the reading data which is computed using the combinational
circuit named align/extend on the Ma stage in Fig.1.
RV32I has five load instructions which are load byte
(LB) to load 8-bit signed data, load byte unsigned (LBU)
to load 8-bit unsigned data, load halfword (LH) to load 16-
bit signed data, load halfword unsigned (LHU) to load 16-
bit unsigned data, and load word (LW) to load 32-bit data.
Therefore, align/extend unit has to align the loaded data
by shifting 8, 16, or 24 bits right depends on the memory
4
IEICE TRANS. INF. & SYST., VOL.Exx–??, NO.xx XXXX 2020
address and operation code of the load instructions. Then,
sign-extension or zero extension is needed for load byte and
load halfword instructions. Finally, the unit selects a proper
value using a large multiplexer depends on the operation
code of the instruction.
We optimized the alignment and sign-extension using
the similar approach to the ALU optimization which is one-
hot encoding and using exclusive OR for value selection.
4.3 Instruction fetch unit optimization
We propose the two-stage pipelining of the branch predictor
to improve the operating frequency of the instruction fetch
unit, which contains the critical path on the baseline proces-
sor.
The relatedworks [18], [19] have shown that the pipelin-
ing of the branch predictor can improve the operating fre-
quency of the soft processor when a complex branch predic-
tor is used. The similar approach is applied to the proposed
branch prediction including gshare and BTB.
Fig.2 (a) shows a block diagram of a general branch
predictor and BTB in the baseline where the prediction is
made in a single cycle. In gshare branch predictor, the index
to access the PHT named m_PHT is obtained by exclusive
OR of PC and BHR named r_BHR. If the fetched instruction
is predicted as a conditional branch instruction using the
BTB, it updates the BHR speculatively using the branch
prediction in the If stage.
The combinational circuit named join shifts BHR left
by 1-bit, and connects the branch prediction result to the least
significant bit of the BHR. If the branch prediction missed,
the BHR is updated with the correct branch history. The
combinational circuit named comb that receives the value
read from theBTBand the value read from the PHTgenerates
the branch prediction named w_btkn.
The critical path of a general branch prediction mech-
anism is the red data path in Fig.2 (a) which includes the
access of the BTB, three LUTs, a multiplexer, wiring delays,
and clock skew. In our preliminary evaluation, the access
of the BTB composed of block RAM on a Xilinx Artix-7
FPGA takes about 2.54ns on the red path. Since the access
of one LUT takes about 0.12ns, the access to the three LUTs
necessary on the red path takes about 0.36ns. Also, the ac-
cess to a multiplexer implemented using hard macro takes
about 0.24ns, and other wiring delays and clock skew takes
about 3ns. Therefore, the total delay of the path exceeds
6.1ns (165MHz) by adding the above delays.
To improve the operating frequency for the proposed
processor, we split this critical path by two registers.
Fig.2 (b) shows the block diagram of the pipelined
gshare and pipelined BTB for RVCoreP. The red critical
path in Fig.2 (a) is divided into three paths by two inserted
registers named r_btb and r_pcx. The data acquired from the
BTB is stored in the register r_btb, and the register r_pcx is
inserted before exclusive OR to generate the index of PHT.
It takes two cycles to determine the value of the next
PC in the instruction fetch stage. In the first cycle in the
M
u
x
+4
m_imem
I_IN
w_npc
m_BTB
r
_pc
m_PHT
r
_B
H
R
+
w_btb
I_ADDR
ExMa_pc_true
r_pc+4
r_pc
join
IfId
w_btkn
co
m
b
M
u
x
+4
M
u
x
r_pc+4
Ma_pc_true
m_imem
decode
r
_if
I_IN
r_pc
w_npc
m_BTB
r
_btb
r
_pc
r
_pc
x m_PHT
r
_B
H
R
w_btkn
+
join
r_btb
I_ADDR
IfId
co
m
b
M
u
x
(a) General branch prediction mechanism (b) Pipelined branch prediction mechanism
If stagepreIf stage
Fig. 2 A general configuration and the two-stage pipelining one for
branch prediction mechanism including gshare, BTB and instruction mem-
ory.
Clock cycle 1 2 3
Inst A (add)
PC=0x100
If
4 5
Id
6
preIf
Inst B (beq)
PC=0x104
Inst M (bne)
PC=0x130
Inst N (sub)
PC=0x134
If
Id
preIf
If
Id
preIf
If
Id
preIf
Ex Ma Wb
Ex Ma
Ex
Wb
Ma
Ex
When previous PC + 4  PC,
the branch prediction is invalid
Fig. 3 The pipeline diagram of instruction fetching using pipelined
branch prediction mechanism in RVCoreP.
preIf stage, accessing the BTB and exclusive OR processing
to determine the PHT index are performed. In the second
cycle in the If stage, the value of the next PC is determined
by using the results from the preIf stage and the instruction
is fetched from the instruction memory.
Fig.3 shows the pipeline diagram of instruction fetch-
ing using the pipelined branch prediction in RVCoreP. The
rectangles written as preIf represents the processing of the
preIf stage, and the rectangles written as If represents the
processing of the If stage. Assuming that four instructions
are fetched in the order of Inst A, Inst B, Inst M, and Inst N.
Inst A and Inst N are add and sub instructions, and these in-
struction addresses are 0x100 and 0x134, respectively. Inst
B and Inst M are beq (branch if equal) and bne (branch if not
equal) instruction, and these instruction addresses are 0x104
and 0x130, respectively. The next PC of Inst B is 0x130
when the branch is taken.
In the clock cycle 1 when the value of PC is 0x100, the
If stage for Inst A and the preIf stage for the next instruction
are executed. In the preIf stage, the value of BTB and PHT
index used in the If stage for the next instruction 0x104 are
MIYAZAKI et al.: RVCOREP : AN OPTIMIZED RISC-V SOFT PROCESSOR OF FIVE-STAGE PIPELINING
5
+4
M
u
x
r_pc+4
Ma_pc_true
Id_rrs1
Id_rrs2
IfId_rs1
IfId_rs2
IfId_ir
Ex_rrs1
Ex_rrs2
ExMa_rslt
MaWb_rslt
MaWb_rslt
alig
n/e
xte
nd
A
LU
_
opt
M
u
x
M
u
x
M
u
x
m_imem
decode
r
_if
I_IN
r_pc
w_npc
m_dmem
Ex_rslt
D_ADDR+
Ex_rrs1
IdEx_imm
D_IN
Ma_rslt
Ex_b_rslt
regfile
IdEx_rrs1
m_BTB
r
_btb
Ma_pc_true
M
u
x
MaWb_rslt
r
_pc
ExMa_rslt
decode
r
_id
ExMa_b_rslt
ExMa_tkn_pc
ExMa_npc
M
u
x
Id_imm
Id_imm
Id_alu_ctrl
Id_bru_ctrl
IfId_ir
IdEx_alu_ctrl
IdEx_bru_ctrl
ExMa_b_rslt
r
_pc
x m_PHT
r
_B
H
R
w_btkn
+
join
r_btb
I_ADDR
IfId_luse
L
oad
-use
If_rs1
If_rs2
IfId_rd
Id stage Ex stage Ma stage
IfId IdEx ExMa MaWb
co
m
b
M
u
x
Wb stageIf stage
w_stall
co
m
b
w_bmis
Fig. 4 The block diagram of the proposed processor named RVCoreP.
prepared by using the current PC value of 0x100.
In the clock cycle 2 when the value of PC is 0x104, the
If stage for Inst B and the preIf stage for the next instruction
are executed. In the If stage for Inst B, the next PC value is
determined by using the value prepared in the preIf stage one
cycle before. In the preIf stage, the value of BTB and PHT
index used in the If stage for the next instruction 0x108 are
prepared by using the current PC value of 0x104. Since Inst
B is a conditional branch and assuming that it is predicted as
taken, the next PC value is 0x130.
In the clock cycle 3 when the value of PC is 0x130, the
If stage for Inst M and the preIf stage for the next instruction
are executed. In the If stage for Inst M, the next PC value is
determined as well but branch prediction is invalid, because
the value prepared in the preIf stage one cycle before is
for the instruction whose address is 0x108, and this value
cannot be used in the If stage for Inst M whose address is
0x130. Therefore, if the value obtained by adding 4 to the
previous PC does not match the current PC value, the branch
prediction is invalid.
From the above, the PC value used for BTB access is
the one cycle earlier value of PC, and the PC value used
for PHT access is the value one cycle before the branch
prediction is output. As a result, gshare outputs a prediction
in 2 cycles. The BTB entry is updated using a value obtained
by subtracting 4 from the PC value of the branch instruction.
When updating a PHT entry, we have to keep the PHT index
value used for the prediction and to update the PHT entry
using this index when the actual branch outcome will be
available.
The prediction accuracy might drop slightly due to the
adverse effect of this optimization to make a prediction and
update the index with the one cycle earlier value of PC.
4.4 RVCoreP soft processor
Fig.4 shows the block diagram of RVCoreP which is a five-
stage pipelined processor including an instructionmemory, a
datamemory, pipelined gshare and pipelinedBTB. TheALU
optimization, the alignment and sign-extension optimization,
and the instruction fetch unit optimization are applied to the
proposal. The unit named ALU_opt in Fig.4 is the optimized
ALU.
The detection timing of the load-use dependency be-
tween a load instruction and the following instruction using
the loaded data is changed from the Id state on the baseline
in Fig.1 to the If stage using the combinational circuit named
Load-use. To support the detection, a part of instruction
decoder named decoder_if is implemented in the If stage.
decoder _if decodes two source registers and one destination
register for one instruction, and generates the write signals
for the register file and data memory. This partial decod-
ing of instruction in If stage allows us to detect the data
dependency including load-use dependency in advance.
Fig.5 shows the pipeline diagrams of the proposed pro-
cessor. Fig.5 (a) shows the case where the pipeline is flushed
due to a branch prediction miss. In the branch prediction
mechanism, the branch target address from the BTB is used
when the BTB is hit and the branch is predicted to be taken.
The correct branch destination address calculation and check
whether the branch prediction is correct or not is executed in
the Ex stage and stored in the ExMa pipeline register. If the
branch instruction is at the Ma stage and the branch predic-
tion missed, the instructions in the If stage, Id stage, and Ex
stage are flushed, which incurs a 3-cycle penalty.
Fig.5 (b) shows the case where the pipeline stalls due
to the load-use dependency. In that case, the dependency
6
IEICE TRANS. INF. & SYST., VOL.Exx–??, NO.xx XXXX 2020
Table 1 The evaluation results of IPC and branch prediction accuracy obtained by Verilog simulation.
Label
Dhrystone Coremark
Average IPC
IPC prediction hit prediction miss hit rate IPC prediction hit prediction miss hit rate
VR-nobp 0.661 N/A N/A N/A 0.591 N/A N/A N/A 0.626
VR-bp 0.836 146,180 29,452 0.832 0.766 348,010 109,701 0.760 0.801
RVP-simple 0.946 205,127 12,507 0.943 0.828 366,726 91,247 0.801 0.887
RVP-optALU 0.946 205,127 12,507 0.943 0.828 366,726 91,247 0.801 0.887
RVP-optIF 0.935 201,153 16,481 0.924 0.823 363,439 94,534 0.794 0.879
RVP-optALL 0.935 201,153 16,481 0.924 0.823 363,439 94,534 0.794 0.879
Clock cycle 1 2 3
Inst A (beq) If
4 5
Id Ex Ma Wb
If Id Ex Ma Wb
If Id Ex Ma Wb
If Id Ex Ma Wb
Inst B (lw)
Inst C (add)
Inst D (sw)
Branch misprediction
6 7 8
3-cycle
penalty
Inst X (add) If Id Ex Ma Wb
9
If Id Ex Ma Wb
If Id
If Id Ex Ma Wb
Inst B becomes nop
WbIf Id Ex Ma
1-cycle
penalty
Clock cycle 1 2 3 4 5 6 7 8 9
Inst A (lw)
Inst B (add)
Inst C (sw)
Inst D (lw)
Inst B (add)
WbIf Id Ex Ma
Pipeline stall
(a) Pipeline flush by the branch misprediction
(b) Pipeline stall by the load dependency
Fig. 5 The pipeline diagrams of the proposed processor.
is avoided by stalling the instruction following the load in-
struction. Using the decoder_if to partially decode the in-
struction in the If stage helps to detect the dependency by
load instruction in Id stage and an instruction in the If stage,
and the detection result is stored in the IdEx pipeline regis-
ter. If the load instruction is in the Ex stage and there is a
data dependency on load instruction, this processor inserts a
bubble in IdEx pipeline register, and stall instructions in the
If stage and Id stage, which incurs a one-cycle penalty.
5. Verification and evaluation
5.1 Verification
We verified the implemented RTL code by Verilog simula-
tion. ARISC-Vprocessor simulatormodeling a conservative
multi-cycle processor named SimRV that we implemented in
C++ is used as the reference model.
SimRV outputs the PC value, the executed instruction,
and the 32 values stored in the register file, when a RISC-V
programbinary is given. By executing the same binary using
SimRV and Verilog simulation for our designed processors,
log files of the same format can be output. We executed
the two benchmark binaries used in the evaluation described
later, and compared each log file. We have confirmed that
their values in two log files match and the programs are
executing correctly.
In addition to the verification through simulations, we
verified the behavior of the designed processor using an
FPGA board. The same RISC-V program binary used for
Verilog simulation is executed on the actual Xilinx Atrix-7
FPGA board, and we have confirmed that the ASCII char-
acter output of the execution results via a serial communi-
cation had matched to the correct result, and confirmed that
the numbers of execution cycles and executed instructions
are also matched.
5.2 Evaluation environment
We implement four versions of the proposed processor in
Verilog HDL and evaluate them in terms of IPC, operating
frequency, hardware resource utilization, and processor per-
formance. We also make two configurations for VexRiscv
processor that supports RV32I, and use these configura-
tions for comparative evaluation with the proposals. The
code used for RVCoreP is Ver.0.4.5. The code version
of VexRiscv processor used for the evaluation is Spinal-
HDL/VexRiscv@ca228a3 committed on 26 September 2019
in GitHub page [8].
The four versions of RVCoreP are named as follows.
The version that applies all the optimizations described above
is called RVP-optALL, and the simple version that does not
apply any optimizations is defined as RVP-simple. The ver-
sion that applies only the ALU optimization and the align-
ment and sign-extension optimization is specified as RVP-
optALU, and the version that applies only the instruction
fetch unit optimization is defined as RVP-optIF.
For the VexRiscv, VR-nobp denotes the configuration
without the branch prediction, and VR-bp denotes the con-
figuration with the branch prediction. We set the parameters
of VexRiscv as follows to make the configuration as close
as possible to RVCoreP. They are reading the register file
asynchronously, using shift instruction implemented with a
full barrel shifter that performs in one cycle, and utilizing
the data forwarding path.
RVCoreP has a branch prediction mechanism including
a gshare with a PHT of 8,192 entries and a BTB of 512
entries. If the branch prediction is enabled in the VexRiscv,
Prediction DYNAMIC_TARGET of the option in BranchPlu-
gin is used, and the number of entries in the direct mapped
prediction cache is set to 512 entries using historyRam-
SizeLog2 parameter.
To execute the RISC-V programwith RVCoreP, we cre-
ate a system including the proposed processor. This system
includes the proposed processor RVCoreP as shown in Fig.4,
an instruction memory, a data memory, and the modules
for RS-232C serial communication with a communication
MIYAZAKI et al.: RVCOREP : AN OPTIMIZED RISC-V SOFT PROCESSOR OF FIVE-STAGE PIPELINING
7
Table 2 The evaluation results of frequency, hardware resource utilization, and performance.
Label
Operating
frequency
Slice
LUT
Slice
register
Slice
Increase rate
of slice
Average
IPC
Processor
performance
Normalized
performance
VR-nobp 205 936 562 284 1.000 0.626 128.4 1.000
VR-bp 140 944 611 300 1.056 0.801 112.1 0.873
RVP-simple 160 1,020 715 349 1.229 0.887 141.9 1.105
RVP-optALU 170 1,070 730 375 1.320 0.887 150.8 1.174
RVP-optIF 180 1,044 749 390 1.373 0.879 158.2 1.232
RVP-optALL 190 1,073 764 397 1.398 0.879 167.0 1.300
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
VR-nobp VR-bp RVP-simple RVP-optALU RVP-optIF RVP-optALL
IP
C
Dhrystone
Coremark
Average
Fig. 6 The IPC for each configuration obtained by Verilog simulation.
buffer. This system reads the RISC-V program binary and
operates for Verilog simulation. Also, the same program for
Verilog simulation runs on Nexys 4 DDR board with Xilinx
Artix-7 FPGA [20] that receives the same binary through the
serial communication module. This system can output the
same characters as the simulation by serial communication.
The number of lines of code for this system is 1,487, ofwhich
the processor RVCoreP has 832 lines of codes. This system
is used to evaluate IPC, operating frequency, and hardware
resource utilization. By replacing the VexRiscv processor
with the processor part of this system, VexRiscv is evaluated
in the same environment.
IPC is evaluated by Verilog simulation using Dhry-
stone [21] and Coremark [22] as benchmarks. We used
the Dhrystone source code published in riscv-tests [23] and
NUMBER_OF_RUNS was set to 2000. The number of ex-
ecuted instructions for Dhrystone is 909,443. We used the
Coremark source code [24] released for RISC-V and ITER-
ATIONS was set to 2. The number of executed instruc-
tions for Coremark is 1,481,298. The source codes of each
benchmark are compiled by using the RISC-V RV32I cross
compiler. The RISC-V gcc cross compiler version 8.3.0 has
been used, and the used optimization flag was -O2. For
benchmark program simulation, the size of both instruction
memory and data memory was set as 32KB.
The operating frequency and hardware resource uti-
lization are evaluated targetting Nexys 4 DDR board [20]
having xc7a100tcsg324-1 FPGA which is a family of Xil-
inx Artix-7 FPGA. Xilinx Vivado 2017.2 is used to evaluate
the operating frequency and hardware resource utilization.
Flow_PerfOptimaized_highstrategy is used for logic synthe-
sis, and Performance_ExplorePostRoutePhysOpt strategy is
used for placement and routing. We performed the logic syn-
thesis and placement and routing by incrementally changing
the clock cycle constraint in 5MHz. The highest frequency
that satisfies the constraints is used as the operating fre-
0
50
100
150
200
VR-nobp VR-bp RVP-simple RVP-optALU RVP-optIF RVP-optALL
O
pe
ra
tin
g 
fre
qu
en
cy
Fig. 7 The maximum operating frequency for each configuration on
Artix-7 FPGA.
quency of the processor. For hardware resource evaluation,
we used the result of placement and routing at the maximum
operating frequency. The size of instruction memory and
data memory is fixed at 4KB for evaluation using FPGA. To
stabilize the operating frequency of the evaluated system, the
placement and routing are performed using only one clock
region of the FPGA.
The processor performance is calculated bymultiplying
the average IPC by the operating frequency.
5.3 Evaluation results
Table 1 shows the evaluation results of IPC and branch ac-
curacy obtained by Verilog simulation. This shows IPC and
the number of prediction hit and miss, and prediction hit rate
for each of the two benchmarks, and the average IPC of these
two benchmarks.
Regarding IPC and the branch prediction hit rate of each
benchmark and average IPC, the four versions of RVCoreP
outperform the two versions of VexRiscv. Note that the
prediction accuracy of the branch predictor drops due to
the pipelined branch prediction. Therefore, the IPC of RVP-
simple andRVP-optALU is higher than the IPCofRVP-optIF
and RVP-optALL.
Fig.6 shows the IPC for each configuration obtained by
Verilog simulation. The orange bars are used for Dhrystone,
the yellow bars for Coremark, and the green bars for the
average. As a whole, Dhrystone has a simpler program
structure than Coremark and a higher branch prediction hit
rate. Therefore, Dhrystone tends to have a higher value of
IPC than Coremark. From this figure, we confirm that the
four versions of RVCoreP outperform the two versions of
VexRiscv.
Table 2 summarises the evaluation results of operat-
ing frequency, hardware resource utilization, and processor
8
IEICE TRANS. INF. & SYST., VOL.Exx–??, NO.xx XXXX 2020
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
VR-nobp VR-bp RVP-simple RVP-optALU RVP-optIF RVP-optALL
IP
C 
pe
rfo
rm
an
ce
 
ra
te
Fig. 8 The processor performance by IPC assuming that the operating
frequency is the same where VR-nobp is normalized as 1.
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
VR-nobp VR-bp RVP-simple RVP-optALU RVP-optIF RVP-optALL
Pr
o
ce
ss
o
r 
pe
rfo
rm
an
ce
 
ra
te
Fig. 9 The processor performance on Artix-7 FPGA where VR-nobp is
normalized as 1.
performance for Artix-7 FPGA. As for slice usage shown
in the 5th column, VexRiscv is more resource-saving than
RVCoreP as a whole. The increase rate of slice shown in
the 6th column is normalized with VR-nobp as 1. The slice
usage of RVP-optALL is 397, which is a 39.8% increase
compared to VR-nobp. Since this slice usage is 2.50% of
the total available slices (which is 15,850) on used Artix-7
FPGA, this increase in slice usage is negligible. 48 LUTs
are used as memory (LUTRAM), which are inferred for the
register file of the processor. Except for VR-nobp which
does not have a branch prediction, only one block RAM for
the tables in branch prediction is used. In all configurations,
the instruction memory and data memory of 4KB consist of
two block RAMs.
Fig.7 shows the maximumoperating frequency for each
configuration on Artix-7 FPGA. The configuration of VR-
nobp has the highest operating frequency of 205MHz. It
can be seen that the operating frequency of the four con-
figurations of RVCoreP is improved by applying each opti-
mization. The best frequency of RVCoreP is 190MHz on
RVP-optALL. Note that among configurations with branch
predictions, RVP-optALL achieves much better operating
frequency than VP-bp running at 140MHz.
Fig.8 shows the processor performance by IPC assum-
ing that the operating frequency is the same where VR-
nobp is normalized as 1. RVP-simple and RVP-optALU
have the highest performance. RVP-optALL achieves 40.3%
performance improvement compared to VR-nobp because
VR-nobp does not have a branch prediction and has low
IPC. RVP-optALL achieves 9.74% performance improve-
ment compared to VR-bp. The other configurations of RV-
CoreP achieve almost the same IPC performance.
Fig.9 shows the processor performance on Artix-7
FPGA where VR-nobp is normalized as 1. This processor
performance considers the operating frequency, and each
value in the graph is the performance improvement rate
from VR-nobp. RVP-optALL achieves 30.0% performance
improvement compared to VR-nobp, which is the highest
performance configuration of VexRiscv. The other config-
urations of RVCoreP achieve performance improvement of
10% or more compared to VR-nobp.
The performance improvement from RVP-simple to
RVP-optALL is 1.176. Therefore, we achieve 17.6% perfor-
mance improvement by using three proposed optimizations.
6. Conclusion
We propose a RISC-V soft processor adopting five-stage
pipelining highly optimized for FPGAs. In the proposed
processor, the instruction fetch unit optimization, the ALU
optimization, and the alignment and sign-extension opti-
mization are applied as effective methods to improve the
operating frequency. We implement this proposed proces-
sor in Verilog HDL and evaluate IPC, operating frequency,
hardware resource utilization, and processor performance
compared with the VexRiscv processor.
From the evaluation results, the proposed processor
RVP-optALL that applied all optimizations achieves 30.0%
performance improvement as processor performanceconsid-
ering operating frequency compared with VR-nobp, which
is the highest performance configuration of VexRiscv. In
addition, the proposed optimization method achieves 17.6%
performance improvement in RVCoreP.
Acknowledgments
This work is supported by JSPS KAKENHI Grant Number
JP16H02794.
References
[1] RISC-V Foundation, “RISC-V | Instruction Set Architecture (ISA).”
https://riscv.org/.
[2] Xilinx, MicroBlaze Processor Reference Guide, v2018.2 ed., June
2018.
[3] Intel, Nios II Processor Reference Guide, April 2018.
[4] A. Waterman, Y. Lee, D.A. Patterson, and K. AsanoviÄĞ, “The
RISC-V Instruction Set Manual, Volume I: User-Level ISA, Version
2.1,” Tech. Rep. UCB/EECS-2016-118, EECS Department, Univer-
sity of California, Berkeley, May 2016.
[5] R. HÃűller, D. Haselberger, D. Ballek, et al., “Open-Source RISC-V
Processor IP Cores for FPGAs âĂŤ Overview and Evaluation,” 2019
8th Mediterranean Conference on Embedded Computing (MECO),
pp.1–6, June 2019.
[6] K. AsanoviÄĞ, R. Avizienis, J. Bachrach, et al., “The Rocket Chip
Generator,” Tech. Rep. UCB/EECS-2016-17, EECS Department,
University of California, Berkeley, Apr 2016.
[7] J. Bachrach, H. Vo, B. Richards, et al., “Chisel: Constructing hard-
ware in a Scala embedded language,” DACDesign Automation Con-
ference 2012, pp.1212–1221, June 2012.
[8] SpinalHDL, “VexRiscv: A FPGA friendly 32 bit RISC-V CPU im-
plementation.” https://github.com/SpinalHDL/VexRiscv.
[9] SpinalHDL, “SpinalHDL: An open source high-level hardware de-
scription language.” https://github.com/SpinalHDL/SpinalHDL.
MIYAZAKI et al.: RVCOREP : AN OPTIMIZED RISC-V SOFT PROCESSOR OF FIVE-STAGE PIPELINING
9
[10] RISC-V Foundation, “RISC-V SoftCPU Contest, October 8, 2018.”
https://riscv.org/2018/10/risc-v-contest/.
[11] University of California, Berkeley, “riscv-mini: Simple RISC-V 3-
stage Pipeline in Chisel.” https://github.com/ucb-bar/riscv-mini.
[12] University of California, Berkeley, “The Sodor Processor: edu-
cational microarchitectures for risc-v isa.” https://github.com/ucb-
bar/riscv-sodor.
[13] University of Cambridge, “Clarvi: simple RISC-V processor for
teaching.” https://github.com/ucam-comparch/clarvi.
[14] D.A. Patterson and J.L. Hennessy, Computer Organization and De-
sign The Hardware / Software Interface, RISC-V Edition, Morgan
Kaufmann, 2018.
[15] S. McFarling, “Combining branch predictors,” tech. rep., Technical
Report TN-36, Digital Western Research Laboratory, 1993.
[16] P. Metzgen, “A High Performance 32-bit ALU for Programmable
Logic,” Proceedings of the 2004 ACM/SIGDA 12th International
Symposium on Field Programmable Gate Arrays, FPGA ’04, New
York, NY, USA, pp.61–70, ACM, 2004.
[17] Xilinx, HDL Synthesis for FPGAs Design Guide, 1995.
[18] D.A. Jimenez, “Reconsidering complex branch predictors,” The
Ninth International Symposium on High-Performance Computer Ar-
chitecture, 2003. HPCA-9 2003. Proceedings., pp.43–52, Feb 2003.
[19] K. Matsui, M. Ashraful Islam, and K. Kise, “An Efficient Im-
plementation of a TAGE Branch Predictor for Soft Processors on
FPGA,” 2019 IEEE 13th International Symposium on Embedded
Multicore/Many-core Systems-on-Chip (MCSoC), pp.108–115, Oct
2019.
[20] Digilent, Inc., Nexys 4 DDR Reference Manual, rev.c ed., 2016.
[21] R.P. Weicker, “Dhrystone: A Synthetic Systems Programming
Benchmark,” Commun. ACM, vol.27, no.10, pp.1013–1030, Oct.
1984.
[22] EEMBC, “CoreMark | CPU Benchmark âĂŞ MCU Benchmark.”
https://www.eembc.org/coremark/.
[23] RISC-V Foundation, “riscv-tests.” https://github.com/riscv/riscv-
tests.
[24] UC Berkeley Architecture Research, “Setup scripts and files
needed to compile CoreMark on RISC-V.” https://github.com/riscv-
boom/riscv-coremark.
HiromuMiyazaki received the B.E degrees
in Department of Computer Science from Tokyo
Institute of Technology, Japan in 2019. He is
currently a master course student of the Grad-
uate School of Computing, Tokyo Institute of
Technology, Japan. His research interest is com-
puter architecture and FPGA computing. He is
a student member of IEICE.
Takuto Kanamori is currently a bache-
lor course student of the School of Comput-
ing, Tokyo Institute of Technology, Japan. His
research interest is computer architecture and
FPGA computing.
Md Ashraful Islam have graduated from
the University of Rajshahi, Bangladesh. I am a
1st-year doctoral student at the Tokyo Institute
of Technology. I have 8-years of experience in
the Semiconductor Industry in ASIC, SoC de-
sign and Verification. My research interest is in
Computer Architecture, especially on Processor
design and memory sub-system design.
Kenji Kise received the B.E. degree from
Nagoya University in 1995, the M.E. degree
and the Ph.D. degree in information engineer-
ing from the University of Tokyo in 1997 and
2000, respectively. He is currently an associate
professor of the School of Computing, Tokyo
Institute of Technology. His research interests
include computer architecture and parallel pro-
cessing. He is a member of ACM, IEEE, IEICE,
and IPSJ.
