Abstract-We propose a fast Modular Arithmetic Logic Unit Therefore, in order to minimize the area and the critical path (MALU) 
I. INTRODUCTION
The originality of our design is that the MALU can execute The implementation of a fast modular operation is a chal-modular multiplication and addition/subtraction (MMA/S) in lenge especially in Public-Key Cryptography (PKC). Among both fields with the same execution time while maintaining PKCs, the best well-known and most widely used PKCs are a high performance. Section II gives a survey of relevant RSA [1] and ECC [2] [3] . A modular exponentiation in RSA is previous work for modular multiplication. In Section III, the based on the iterative usage of modular multiplications over specification of our proposed MALU is described. Section IV GF(p). Regarding ECC, it uses modular multiplication and introduces a system architecture using the MALU. Section V addition over GF(p) and GF(2m) in its point multiplication. presents the implementation result on a Xilinx FPGA. The In embedded systems, ECC is regarded more suitable than result is compared with other previous work and Section VI RSA because ECC operates with higher performance, lower concludes the paper. power consumption, and smaller area of hardware because of the shorter key-lengths. Nevertheless, RSA The proposed architecture is primary targeting GF(p) MontBefore explaining the general case, the main functionality gomery modular multiplier with digit-serial multiplications of the MALU is explained with the case of d = 1. In (Algorithm I). Four-to-two (4-2) CSAs ( Fig.1-a) or equivalent this configuration, each cell is composed of one 4-2 CSA. redundant adders have been used in a hardware implementa-The 4-2 CSA sums up four-bit inputs that are xy, mn,s, tion because they are considered as one of the most optimal and c and outputs two bits in the redundant CS-form whose solutions for a multi-operand addition including Algorithm I. value is 2(Cnext) + Snext. The bit multiplication xy and mn The cell, a column of the datapath of the MALU uses d sets are main inputs for computing the bit level of Montgomery of 4-2 CSAs (Fig.1-b) , i.e., the inputs and outputs of the cell multiplication in Algorithm I, i.e. (T + xy + mn)/2. are presented in 2-bit CS-form during the execution. Therefore, the cell needs 2d sets of FAs. If we also want to support GF(2m), it is necessary to mask the carry signals with a field- 
