Improving Performance of Software Implemented Floating Point Addition by Hindborg, Andreas Erik et al.
  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  
General rights 
Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners 
and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. 
 
• Users may download and print one copy of any publication from the public portal for the purpose of private study or research. 
• You may not further distribute the material or use it for any profit-making activity or commercial gain 
• You may freely distribute the URL identifying the publication in the public portal  
 
If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately 
and investigate your claim. 
   
 
Downloaded from orbit.dtu.dk on: Dec 19, 2017
Improving Performance of Software Implemented Floating Point Addition
Hindborg, Andreas Erik; Passas, Stavros; Karlsson, Sven
Publication date:
2011
Document Version
Publisher's PDF, also known as Version of record
Link back to DTU Orbit
Citation (APA):
Hindborg, A. E., Passas, S., & Karlsson, S. (2011). Improving Performance of Software Implemented Floating
Point Addition. Poster session presented at 4th Swedish Workshop on Multicore Computing, Linköping, Sweden.
Improving Performance of Software Implemented
Floating Point Addition
Andreas Erik Hindborg, Stavros Passas and Sven Karlsson
Technical University of Denmark
Motivation
IMulticore processors and systems are often constrained on power and
hardware resources – It matters how resources are spent
IDedicated hardware for floating point (FP) operations requires valuable
hardware resources and consumes power
IAccelerators consume valuable chip area and may lead to an overall
reduction of the number of cores
IAchieving acceleration of FP operations without spending valuable silicon
area on big accelerators is desirable
Contributions
IWe propose simple hardware extensions to an integer processor
pipeline that enables acceleration of IEEE 754 [4] FP addition operations
IWe simulate five core configurations with support for our extensions to
evaluate their performance behavior
Methodology
IWe propose twelve instructions to efficiently implement FP addition
IWhen executed in sequence they realize FP addition
IThe instructions can be implemented by reusing many of the logic blocks
found in modern processor cores
IWe estimate that a low amount of additional logic is needed
Simulation Setup
IWe use the cycle accurate SimpleScalar ARM sim-outorder
simulator [1]
IFor each core configuration we execute the 470.lbm SPEC2006
benchmark [3]
IWe simulate the following base configuration:
Configuration Super scalar Memory subsystem
Config. 1 Yes Real
IAnd the following extended configurations:
Configuration Description
Config. 2 Config. 1 with dedicated addressing unit
Config. 3 Config. 1 with four extended integer units
Config. 4 Config. 2 with four integer units (one extended)
Config. 5 Config. 2 with four extended integer units
Simulation Accuracy
IUtilizes twelve simple instructions to implement a single FP addition
I Simulates these instructions as a single instruction that occupies the
integer pipeline for twelve cycles
ICaptures the effects of resource allocation of the functional unit
IEffects of fetch, decode and commit are not simulated
Results
 2
 4
 6
 8
 10
 12
 14
 16
Config. 1
Config. 2
Config. 3
Config. 2
Config. 4
Config. 5
R
el
at
iv
e 
S
lo
w
do
w
n 
fo
r F
P
 a
dd
iti
on
Figure: Simulation results for our core configurations. Results shown are relative slowdowns
compared to full hardware support.
IConfiguration 1: Super-scalar baseline processor with real memories
IConfiguration 2: The addition of dedicated addressing unit increase the
performance by 3.1 %
IConfiguration 3: The use of four extended integer units improves the
performance by 13.7 %
IConfiguration 4: The addition of four integer units where only one is
extended increases the performance by 0.5 %
IConfiguration 5: The use of four extended integer units improves the
performance by 9.9 %
Conclusions
IOur benchmark exhibits a relative slowdown of 3.38 to 15.15 when
compared to dedicated hardware acceleration
IPure software implementation leads to relative slowdowns of up to 45.33
IFor processors with extra dedicated integer or addressing units
performance improves by up to 13.7 % over our base configuration
Future Work
IDevelop actual hardware model for the proposed methods using HDL
IValidate the cost of the proposed instructions with respect to to area and
power resources
IExtend the current work to include other operations such as division and
multiplication
Related Work
IChong et. al. [2] states that when using a pure software FP
implementation, 90 % of the instructions are FP computations
IRodolfo et al. [5] shows a speedup of 22 when using hardware instead of
software FP operations
References
[1] D. Burger and T. M. Austin. The SimpleScalar tool set, version 2.0. SIGARCH Computer Architecture News, 25:13-25, June 1997.
[2] Y. J. Chong and S. Parameswaran. Custom Floating-Point unit generation for embedded systems. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2009.
[3] John L. Henning. SPEC CPU2006 benchmark descriptions. SIGARCH Computer Architecture News, 34:1-17, Sept 2006.
[4] IEEE Computer Society, IEEE, 3 Park Avenue, New York, NY, USA. IEEE Standard for Floating-Point Arithmetic (IEEE Std 754TM-2008), Aug 2008.
[5] T. A. Rodolfo et al. Floating point hardware for embedded processors in FPGAs: Design space exploration for performance and area. In Proceedings of ReConFig, 2009
DTU Informatics - Technical University of Denmark s062068@student.dtu.dk http://www.imm.dtu.dk
