Cycle-Accurate Evaluation of Software-Hardware Co-Design of Decimal
  Computation in RISC-V Ecosystem by Mian, Riaz-ul-haque et al.
Cycle-Accurate Evaluation of Software-Hardware
Co-Design of Decimal Computation in RISC-V
Ecosystem
Riaz-ul-haque Mian
Graduate School of Information Science
Nara Institute of Science and Technology
Ikoma, Japan
mian.riaz-ul-haque.mn3@is.naist.jp
Michihiro Shintani and Michiko Inoue
Graduate School of Science and Technology
Nara Institute of Science and Technology
Ikoma, Japan
{shintani,kounoe}@is.naist.jp
Abstract—Software-hardware co-design solutions for decimal
computation can provide several Pareto points to development
of embedded systems in terms of hardware cost and perfor-
mance. This paper demonstrates how to accurately evaluate
such co-design solutions using RISC-V ecosystem. In a software-
hardware co-design solution, a part of solution requires dedi-
cated hardware. In our evaluation framework, we develop new
decimal oriented instructions supported by an accelerator. The
framework can realize cycle-accurate analysis for performance
as well as hardware overhead for co-design solutions for decimal
computation. The obtained performance result is compared with
an estimation with dummy functions.
Index Terms—RISC-V, RoCC, Hardware accelerator, Rocket
chip, Decimal arithmetic, Decimal multiplication, Evaluation
framework
I. INTRODUCTION
Decimal arithmetic is widely used in financial and scientific
applications. Thus, IEEE 754 (Standard for floating-point
arithmetic) has been revised to include decimal floating-point
formats and operations [1]. Many software (SW) languages
support decimal arithmetic that is realized with binary hard-
ware units. However, these may not be satisfactory for a very
large application in terms of performance. Many financial
applications need to keep the quality of their customer service
concurrently with the back-end computing process where
computing time is a matter for the business owner.
The decimal arithmetic can be computed with software
(arithmetic with binary hardware units) [2]–[4], hardware
(dedicated hardware unit for decimal floating-point arith-
metic) [5]–[8], or combination of both [9]. Software solutions
are flexible and no additional hardware cost is involved.
Hardware solutions require high-performance dedicated dec-
imal units with high hardware cost. Software-hardware co-
design solutions can co-optimize flexibility, performance and
hardware cost and give several Pareto points to development of
embedded systems. In software-hardware co-design solutions,
a part of solution requires some dedicated hardware while
other part can be executed in standard processors supporting
binary arithmetics. However, evaluation of co-design solutions
requires special evaluation environments.
In [9], four software-hardware co-design methods for dec-
imal multiplication are proposed. part. A software part is
evaluated by running it in several software platforms by
replacing hardware part with dummy functions, while a hard-
ware part is evaluated by designing hardware with computer-
aided-design tools. The environment can roughly evaluate the
total performance as an execution time of software program
with dummy functions.
To obtain more accurate evaluation, integrated environment
with dedicated hardware design, software platform, and the
interface between them is required. An open-source processor
like UltraSparc T2 architecture [10] from Oracle/Sun (the first
64-bit microprocessors open-sourced) with standard SPARC
instruction set architecture [11] can be used for such eval-
uation. However, it requires not only adding new decimal
floating-point units and new instructions for them but also
software tools to generate and simulate binary codes for the
new architecture. SPARC V9 architecture provides IMPDEP1,
2 (Implementation-Dependent Instruction 1,2) and they can be
used for new custom instructions.
In this work, we develop an evaluation framework for
software-hardware co-design of decimal computation using
RISC-V ecosystem [12]. RISC-V ecosystem is an open-
source environment including RISC-V ISA, Rocket chip (one
hardware implementation for RISC-V), RoCC (Rocket custom
co-processor, Rocket chip interface to support accelerators),
several languages for software and hardware, and several tools
for verification and evaluation. In the proposed framework, a
co-design solution is realized as a software that accepts new
decimal-oriented instructions, and the new instructions are
supported by a dedicated accelerator. Cycle-accurate analysis
is given by emulating RISC-V binary on Rocket chip with the
dedicated accelerator.
The key contribution of this paper is listed below:
1) Development of an evaluation framework for software-
hardware co-design solutions of decimal computation.
ar
X
iv
:2
00
3.
05
31
5v
1 
 [c
s.A
R]
  1
1 M
ar 
20
20
Input X,Y
Special?
Sign Xs,Ys
No
Exp Xe,Ye
Yes
Coeff Xc,Yc
pp[0] = 0
Set i = 1 
i<10 
?
pp[i+1] = 
pp[i]+ pp[1]
i = i+1
Convert to BCD
Result = 
result + pp[k]
Result
j = # of digit in Yc 
Result = 0
j = 0 
?
j = j-1
k = jth digit YcExp Result  = Temp 
exp + Rounding Digit
No
No
Yes
Combine sign,
exponent, and 
coefficient
pp[k] decimal 
left shift
Check rounding   
 
 
 
Yes
pp[1] = Xc
Sign
R
esult = X
s ⊕  Y
s
Temp Exp
 = Xe + Ye
R
esult  = Special Value
Fig. 1: Flow of method-1 in [9]. White and gray blocks for
software and hardware respectively.
2) Evaluation of software-hardware co-design solution of
decimal multiplication.
3) Open-source project of proposed framework available at
www.decimalarith.info.
The organization of the paper is as follows: In Section II
decimal multiplication using software hardware co-design are
discussed. In section III the overview of the proposed frame-
work are presented. The evaluation of decimal multiplication
using the framework is proposed in Section IV and the
evaluation results are discussed in V. Finally, the conclusion
is provided in Section VI.
II. CO-DESIGN FOR DECIMAL MULTIPLICATION
Decimal floating-point (DFP) number system represents
floating-point number using base 10. A number is finite or
special value. Every finite number has three integer parameters
sign, coefficient, and exponent.
IEEE 754-2008 compliant decimal multiplication process
has the following basic steps: first, both operands are checked
whether they are a finite number or special values such as NaN
(not a number) or infinity. If either operand is a special value,
then the special general rules are applicable for the operation;
otherwise, the operands are multiplied together with following
basic steps:
TABLE I: Development environment
Description
Compiler GNU GCC for RISC-V as a cross compiler [14]
ISA simulator SPIKE [15]
Programming language High-level: Scale, C++, and C
Hardware description language: Chisel [16]
Assembly: RISC-V assembly
ISA RISC-V [12]
Processor core Rocket [17]
Decimal Software Library decNumber C library [2]
Testing Arithmetic operations verification database [18]
• The sign and initial exponent are calculated by doing
XOR and addition operation between the signs and
exponents of multiplier and multiplicand.
• Coefficient multiplication is performed to produce the
coefficient of the result.
• If the result exceeds the precision then rounding opera-
tion is applied with various rounding algorithm [4].
• Finally, the exponents are adjusted accordingly.
The process of such multiplication can be designed using a
combination of software and hardware. Some solutions have
been proposed in [9]. In this paper, we propose an evaluation
framework and evaluate Method-1 of [9] as an example.
Figure 1 shows an overall flow of Method-1 that is one
of the solutions in [9]. The method requires one BCD-CLA
(BCD carry-lookahead adder) as an accelerator to generate
multiplicand multiples and accumulate partial products. In
addition, no binary conversion is required. To obtain the
product of coefficients of multiplicand (Xc) and multiplier
(Yc), these values are first converted into BCD binary-coded
decimal from DPD (densely packed decimal) [13] in software.
The DPD encoding, where the coefficient encoding is is very
close to BCD and can be easily converted to BCD. Hardware
component BCD adder is used to generate multiplicand multi-
ples 1Xc to 9Xc by adding Xc repeatedly. Then the final sum
is calculated by adding and shifting the multiplicand multiples
according to the digit of Yc. The exponent of the result is
finalized by adding the number of the rounded digits.
III. OVERVIEW OF FRAMEWORK
The proposed framework uses RISC-V ecosystem, decimal
C library (decNumber), arithmetic verification test case and
our developed set of test programs. All major components
and software tools, used to integrate the framework, are listed
in Table I. Note that they are fully open-source programs.
In a co-design methodology, it is very important to decide
which functions or operation should have dedicated hardware,
and which functions should remain in software to reduce
hardware overhead and increase the speed of computing with
several tradeoffs. Many parameters including encoding system
(BID, DPD), internal format of a decimal number, base, etc.
need to be considered for the design.
To design the framework considering software-hardware
co-design, we develop a set of hardware components and
software program. We include some area efficient hardware
components along with associated software that supports
GCC RISCV RISC V GNUTool-chain
RISCV
Binary
Accelerator(Hardware)
Emulate
and
Evaluate
MACRO
New
Instruction
Algorithm(software)
Accelerator supportSoftware design
Implement
Rocket Chip
 with
 RISC-V ISA
 Result
Algorithm 
design using 
accelerator
GCC RISC-V cross compiler
New hardware 
design for
 decimal 
Computing
Design Finite 
state Machine 
for custom 
instruction 
(function design)
Interface Send
 and receive
 data from
 the accelerator
Programming Model and instruction definition Architecture of the design
Test and 
verification 
Database
IBM
decNumber
Library
Test 
Program
.
Rocket Chip
 with New 
Hardware
Fig. 2: Overview of proposed framework. Gray color indicates our contribution.
decimal computing. Hardware components are realized as a
dedicated accelerator. RISC-V based Rocket core and Rocket
custom coprocessor (RoCC) are used in the framework.
The software design may adopt some existing process
form [2], [3] with replacement of some expensive and suitable
portion with hardware like [9], or a completely new method
with new instructions can also be designed. In our design, we
use base billion, DPD encoding, with BCD-8421 on hardware.
However, the design parameter can be flexibly changed by the
framework user.
In addition to hardware and corresponding software, we
also develop a test program generator written in C. The
purpose of the generator is to configure the parameters. To use
the generator, we first set up some mandatory and optional
configurations including the format of precision (double or
quad), input data-type (rounding, overflow, normal, underflow,
etc.), type of the arithmetic operation (addition, subtraction,
multiplication or any other), the number of repetition par
calculation, pattern of output (execution time or number of
cycle) etc. Then the test program to evaluate target co-design
solution is automatically generated.
IV. EVALUATION FRAMEWORK
The architecture (hardware) and programming model (soft-
ware) are described in this section. The overall model of the
proposed framework is depicted in Fig. 2. On the hardware
part, necessary FSM and hardware description for the accel-
erator are designed. Rocket chip is then compiled and built
with the newly generated accelerator and an executable for the
emulator is generated. On the other hand, in software part,
RISC-V in-line assembly and C source code are compiled
by RISC-V GCC cross compiler to generate RISC-V binary.
After that, the binaries are simulated by SPIKE ISA simulator
rs2(5) rs1(5) xd xs2xs1 rd(5) Opcode(7)Function7(7)rocc instruction (1) (1) (1) Custom 0-3
Fig. 3: RoCC instruction encoding (number of bits).
TABLE II: List of instructions
Function Function7 Description
WR 0000000 Write a valueto a register in Rocket core
RD 0000001 Read a valuefrom a register in Rocket core
LD 0000010 Load a value from a memory
ACCUM 0000011 Accumulate a value into a registerin Rocket core
DEC_CNV 0000110 Convert binary number tocorresponding BCD
DEC_MUL 0000111 Multiply two BCD numbers
DEC_ADD 0000100 Add two BCD numbers
DEC_ACCUM 0001000 Accumulate BCD numbers storedin internal registers
for functional verification. Hereafter RISC-V machine code
is generated to be executed on the emulator. Finally, the
emulator is executed to get the evaluation output and wave
forms. A detailed description of architecture (hardware) and
programming model (software) is given below:
A. Architectural Design (Hardware)
To embed the decimal arithmetics into a RoCC co-
processor, two major parts, interfacing and executing units,
are required. RoCC has three default interface signals and
they are:
1) Core control (CC): for co-ordination between an accel-
erator and Rocket core.
TABLE III: RoCC instructions (number of bits)
31 25 19 15 11 6
Instruction Source -1 Source-2 Addressing mode Destination Fixed opcode
Function7(7)
rocc instruction rs2(5) rs1(5)
xd
(1)
xs1
(1)
xs2
(1) rd(5)
Opcode(7)
Custom-0
CLR_ALL (0000101) 00000 00000 0 0 0 00000 0010111
RD (0000010) 00000 01011 0 0 1 00000 0010111
WR (0000000) 00000 01011 1 0 0 00000 0010111
DEC_ADD (0000100) 01010 01011 1 1 1 01100 0010111
2) Register mode (Core): for the exchange of data between
an accelerator and Rocket core.
3) Memory mode (Mem): for communication between an
accelerator and L1-D Cache.
Besides the default interface, there is some extended RoCC
interface like floating-point unit for send and receive data from
an FPU and few more. The interface between the core and
accelerator with default interface signal that communicates
through core and memory by the command (cmd), response
(resp).
Figure 3 shows RoCC 32-bit custom instruction encoding
with the bit length of several parts of the instruction. The
opcode is selected from custom-0 to custom-3 reserved for
custom instructions, and each opcode realizes several func-
tions using function7 bits. Each instruction can have at
most one destination register rd and two source registers rs1
and rs2.
Three flags xd, xs1, and xs2 specify whether registers in
Rocket core are used as destination or source registers. That
is, these flags show the necessity of synchronization between
Rocket core and the accelerator. If the flag value is 1, a register
in Rocket core is used, that is, the values are transferred with
the instruction when xs1=“1” or xs2=“1” and Rocket core
waits for the response when xd=“1”, otherwise, it specifies
the address of the register file in the accelerator or the
corresponding field is not used.
Table II lists the developed instructions for decimal arith-
metic in the framework. Though we have already developed
some of instructions with dedicated hardware, any such hard-
ware component can be integrated into the design.
The RoCC interface is used to realize communication
between Rocket core and the accelerator or between memory
(cache) and the accelerator through cmd and resp. The accel-
erator receives and decodes the commands from Rocket core.
Depending upon the value of function7, the corresponding
function is executed. Table III shows the list of some new
instructions used in the framework with corresponding values,
where core internal integer register 10 and 11 are used as
source registers and 12 is used as the destination register.
Figure 4 shows the high-level architecture of the accelerator,
while Fig. 5 shows the implementation of the finite state
machine required to implement the accelerator.
For example, when (DEC_ADD instruction is required, it
performs BCD (Binary Coded Decimal) addition for two
operands and writes the result to the destination register. The
R
ocket C
ore
Cache
Register Set Execution
FSM
Decode and Interface
FSM
Control
logic
BCD ADD
0
1
2
-
mem req mem resp
Accelerator
X
Y
cmd
resp
RoCC
Fig. 4: Basic block of Method-1 in [9].
IdleRD WR
CLR_ALL DEC_ADD
Write RespRead Resp
Fu
nc
tio
n_
clr
Functio_DECADD_ready
Read Write cmd_res
Fu
nc
tio
n_
clr
_r
ea
dy Function_DECadd
ACCUM
Function_accum
Fig. 5: Interface FSM of required function for Method-1 in [9].
pseudo-code written in Chisel for an instruction DEC_ADD
(function7 = “0000100”) is as follows.
val DEC_ADD = funct === UInt(4)
//specific function call
when(cmd.fire() && (DEC_ADD))
//write the result after add X , Y
regfile(addr) := CLA(cmd.bits.rs1,cmd.bits.rs2)
When the accelerator receives DEC_ADD command, it exe-
cutes the command by adding values of two source registers
rs1 and rs2 and then write the result to destination register
xd. Here CLA (Carry-lookahead adder) performs BCD add
operation. Once the accelerator design is completed, the new
chip is generated with the new accelerator.
B. Programming Model (Software)
A set of software programs are implemented for the main
algorithm where dedicated functions are used to call new
instructions supported by the accelerator.
TABLE IV: Average number of cycles of Method-1 using the
framework
SW Part HW part Total Speedup
Method-1 [9] 1013 188 1201 2.73x
Software [2] 3285 0 3285 -
Method-1 using
dummy function [9] 1446 0 1446 2.27x
TABLE V: Evaluation of Method-1 by real implementation
Time (sec) Speedup
Method-1 using
dummy function [9] 589 2.32x
Software [2] 1367 -
The fragments of the pseudo-code using accelerator are as
follows:
MM[0]=0; MM[1]=X;
for ( i = 1; i<9 ; i=i+1){
//Function call in-line assembly
DEC_ADD_rocc(MM[i+1],MM[1], MM[i]);
}
for ( j=0; j <64 ; j=j+4){
DEC_ADD_rocc(product,product,MM[MID(Y_64,j,j+4)]);
product << 4; // shift one decimal digit
}
The first fragment is used to generate multiplicand multiple
and the second fragment is for accumulating the partial
product to generate final result (See Fig. 1 on Sec. II).
DEC ADD rocc is a function to call DEC ADD instruction.
The function DEC ADD rocc takes two BCD inputs and
return the result after BCD addition. A set of MACROs and
inline assembly programs are used to define the function
BCD ADD rocc. In this example code, MID(A, B, C) is
MACRO to extract the specific bit of A within the range C to
B. As both source are in BCD format, where every four bit
represents one digit, thus in every cycle of the for loop pick
one digit of multiplier and then add multiplicand multiple to
generate the final product. The function that calls DEC_ADD
instruction is as follows.
int DEC_ADD_rocc(int a, int b, int c) {
asm __volatile__ (".word0x08A5F617\n");
return a;
}
In the code, ”0x08A5F617” is the hex value for instruction
custom-0 (DEC_ADD), where Rocket internal registers 10
and 11 are used as source registers and 12 is used as the
destination register. In our framework, we also provide a set
of dynamic MACROs to automatically generate the hex value
of corresponding instruction.
V. EXPERIMENTAL RESULTS
In [9], the performance has been evaluated by replacing the
hardware part with a static function called dummy function.
The dummy functions have a fixed return type and designed
according to methods algorithm. This dummy functions are
called from the software function repeatedly according to
TABLE VI: Evaluation of Method-1 using Gem-5 targeting
RISC-V ISA
Time (sec) Speedup
Method-1 using
dummy function [9] 0.005443 2.30x
Software [2] 0.012511 -
method design. This approach is considered to include in-
terfacing time for software and hardware. Such an evaluation
process has the following limitation:
1) Computing time highly dependent on the nature of the
input, like rounding operation takes higher time than
normal operation. However, the dummy function always
return a fixed value and the execution may not follow
the expected flow.
2) The cycle time for the processor may not the same with
and without new hardware.
Method-1 of [9] with hardware accelerator is implemented and
evaluated total number of cycles with 8,000 sample inputs in-
cluding overflow, underflow, normal, rounding, and clamping
cases. For comparison, IBM decNumber C library for double
precision decimal floating-point arithmetic [2] is examined as
a software-based solution, and Method-1 with dummy func-
tions is also examined. We use RISC-V RDCYCLE instruction
to count the number of cycles. Table IV summarizes the result.
From the result, it is shown that Method-1 with the accerlator
is 2.73 times faster than the software-based solution, while
Method-1 with dummy functions is 2.27 times faster.
The result of Table IV is now compared with two other
evaluations to compare a software-based solution (IBM dec-
Number C library) and Method-1 with dummy functions.
In Table V, real implementations of these two methods are
executed at Windows 10, 64-bit on Intel core i7 2.29 GHz
with 8 GB RAM. The second evaluation uses Gem-5 [19]
simulator with AtomicSimpleCPU at system call emulation
(SE) mode. In SE mode we need to specify a binary file to be
executed of Method-1 of [9]. In this evaluation, we use RISC-
V as the target ISA [12]. Here we also use 8,000 sample
and the result is summarizes in Table VI. From Table IV
through Table VI, it is found that dummy function based
evaluations in three different evaluation exhibit almost the
same speedup (2.27x, 2.32x, 2.30x). That shows our proposed
framework accurately evaluated the cycle times for the target
program. The proposed framework shows the exact result with
2.73 times improvement in cycle times. The outcome of the
result presents a successful evaluation of the framework. Our
proposed framework design considering hardware accelerator.
Such an interface imposes a latency overhead during data
exchange with CPU because of the position of the interface
into the pipeline. It also depends on how the core and interface
are handled. The impact of such a problem depends upon the
frequency of data exchange between the main core and the
accelerator. Also, due to cache random replacement policy,
Rocket chip is responsible for computing the number of
cycles in nondeterministically. However, as the framework is
proposed for the integrated evaluation where a large numbers
of input samples with many repetition, the framework can
show statistically meaningful results.
VI. CONCLUSION
This paper presents an integrated evaluation framework
using the RISC-V ecosystem, IBM decNumber library, ver-
ification test database with our developed set of test pro-
grams. he framework is designed to accurately evaluate
software-hardware co-design based decimal computation. A
decimal multiplication based on software hardware co-design
is implemented to the framework to validate the concept
of combined decimal multiplication by the actual result.
The framework can perform both functional and behavioral
evaluation of any such software-hardware co-design of deci-
mal computation. The framework is an open-source project,
and links to all of the source files are available online at
www.decimalarith.info.
REFERENCES
[1] “IEEE standard for floating-point arithmetic IEEE Std 754-2008,” pp.
1–70, 2008.
[2] “C decNumber Library : access date 2018-01-02,” [Online: http://
speleotrove.com/decimal/decnumber.html].
[3] “IEEE 754-2008 Decimal Floating-Point Compliance
Library,” [Online: https://software.intel.com/en-us/articles/
intel-decimal-floating-point-math-library].
[4] M. Cornea, J. Harrison, C. Anderson, and P. T. P. Tang, “A software
implementation of the IEEE 754R decimal floating-point arithmetic
using the binary encoding format,” IEEE Transactions on Computers,
vol. 58, no. 11, pp. 148–162, 2009.
[5] S. Carlough, A. Collura, S. Mueller, and M. Kroener, “The IBM
zEnterprise-196 decimal floating-point accelerator,” in Proceedings of
IEEE International Symposium on Computer Arithmetic, 2011, pp. 139–
146.
[6] E. M. Schwarz, J. S. Kaepernick, and M. F. Cowlishaw, “The IBM z900
decimal arithmetic unit,” in Asilomar Conference on Signals Systems
and Computers, 2001, pp. 1335–1339.
[7] E. M. Schwarz, J. S. Kapernick, and M. F. Cowlishaw, “Decimal
floating-point support on the IBM system z10 processor,” IBM Journal
of Research and Development, vol. 53, no. 1, pp. 4.1–4.10, 2009.
[8] “Fujitsu’s new generation SPARC64 processor,” [Online:
http://www.fujitsu.com/global/products/computing/servers/unix/
sparc-enterprise/key-reports/featurestory/sparce-feature1209.html].
[9] M. R. ul haque, M. Shintani, and M. Inoue, “Decimal multiplication
using combination of software and hardware,” in Proceedings of IEEE
Asia Pacific Conference on Circuits and Systems, 2018, pp. 239–242.
[10] “OpenSPARC T1,T2 processor design source code, simulation tools,
design verification suites, and other tools under open-source licenses,”
[Online: http://www.opensparc.net].
[11] “Oracle SPARC Architecture,” [Online: https://www.oracle.com/
technetwork/sparc-architecture-2015-2868130.pdf].
[12] “The RISC-V Instruction Set Manual, Volume I: Base User-Level ISA,”
[Online: https://riscv.org/].
[13] M. Cowlishaw, “Densely packed decimal encoding,” IEE Proceedings
- Computers and Digital Techniques, vol. 149, pp. 102–104, 2002.
[14] “GCC, the GNU Compiler Collection,” [Online: https://gcc.gnu.org/].
[15] “Spike RISC-V ISA Simulator,” [Online: https://github.com/riscv/
riscv-isa-sim].
[16] J. Bachrach, H. Vo, B. Richards, Y. Lee, and A. Waterman, “Chisel:
constructing hardware in a scala embedded language,” in Proceedings
of IEEE/ACM Design Automation Conference, 2012, pp. 465–471.
[17] “The Rocket Chip Generator,” [Online: https://riscv.org].
[18] A. A. R. Sayed-Ahmed, H. A. H. Fahmy, and M. Hassan, “Three engines
to solve verification constraints of decimal floating-point operations,” in
Asilomar Conference on Signals Systems and Computers, 2010, pp. 1–4.
[19] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu,
J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell,
M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood, “The Gem5
simulator,” ACM SIGARCH Computer Architecture News, vol. 39, no. 2,
pp. 1–7, 2011.
