Abstract. The need to manage and secure a rapidly growing information network has focused increasing attention on smart card technology. Over the past decade, smart cards evolved from offering basic memory to complex systems with chips that incorporate powerful processing units with dedicated peripherals. This evolution enabled a wide range of applications. Smart card applications include financial transactions, e-commerce, physical access control, health and transportation services, and access to such wireless systems as the global system for mobile communication (GSM) and the upcoming universal mobile communications system third-generation mobile phones. Such applications depend on smart cards equipped to perform onboard cryptographic digital-signature encryption and authentication. Smart card operating systems (OSs) use these cryptographic features to manage data storage and control access to private information.
Smart Cards as Security Tokens
Essentially, smart cards serve as security tokens by securely storing users' personal data and service providers' private information. The cards interact within a system using special communication interfaces and dedicated protocols. Smart cards provide highly reliable mechanisms for storing, accessing, and using data in nonvolatile memory. Data access control and data management follow a security policy based on cryptographic services and defined for a specific application.
Europe produces most of the smart cards, Asia has adopted them, and US interest in the technology is growing. An inherent symbiosis between hardware and software in smart cards will be central to their continued evolution. Three examples -secure storage, cryptography services, and smart card flexibilitywill demonstrate this symbiosis.
Secure storage
Sensitive data such as personal information, secret keys, and private application information stored in smart card memory is protected by combined hardware and software mechanisms. The write/store operation is more aggressively protected in smart cards than in any other device. Special onboard security sensors prevent alterations to memory during data storage or reading. In addition, the software includes a backup mechanism in case of card power-down during storage. Access and storage mechanisms can be a combination of typical OS access management and systematic cryptographic verification to authenticate the application or to ensure transaction confiden-tiality. Smart card hardware features like a hardwired firewall between memory areas can make storage even more secure. Moreover, even if chemical or electrical corruption alters memory, hardware memory integrity checks or a software checksum will detect the alteration. Finally, hardware and software protect against illegal tape-out reading of smart card data. These combined measures offer powerful assurance of data privacy.
Arithmetic computation required for cryptography
To ensure authentication, confidentiality, and integrity through cryptography, smart cards must have enhanced arithmetic computation capabilities. Typical smart cards use secret-key algorithms such as the well-known Data Encryption Standard (DES) [1, 2] , the Advanced Encryption Standard (AES) [3] , or other proprietary algorithms specified by telecom operators. These algorithms mainly use data substitutions, permutations, compression and table lookups, and Boolean-to-arithmetic conversions. These generally simple operations lead to fast implementations even when performed in a high-level language. The associated keys are short (from 56 to 256 bits) and quite easy to manage. High-end smart cards offer far more powerful cryptographic algorithms, known as public-key algorithms. Examples include RSA [1, 2, 4] for encryption/decryption and digital signatures. These schemes require an arithmetic unit to compute modular multiplication and reduction on large numbers, because the keys are at least 512 bits long and may reach 2,048 bits. With either type of algorithm, chips may have dedicated peripherals such as DES or RSA cryptocoprocessors for efficiency. The OS would use such peripherals, for example, during an authentication scheme, either directly or by adding software to enhance hardware security.
Ciphering a 64-bit message with DES implemented by software takes about 10 ms at 3.57 MHz on a low-end 8-bit smart card CPU, and the code requires at least 1 Kbyte of storage. A hardwired DES implementation takes around 10 ∝s, with a few bytes for the code. Cryptocoprocessors generally provide faster performance but increase the silicon size and thus affect price. Smart card providers continually seek an efficient balance between speed and security.
System flexibility
For 10 years, the chips and the smart card OSs have been proprietary: There is no smart card OS standard. Moreover, constraining standards such as ISO 7816-4 [5] , GSM 11.11 [6] , or dedicated customer applications drive protocols. It isn't easy to add features or modify existing ones.
The revolution came with open cards, which encompass many applications. Java Cards [7] , for example, let a Java virtual machine (JVM) exceed hardware and software limitations. The Java Card supports a subset of the standard Java bytecode (executable code on a JVM). When the card issuer or the card user downloads Java Card applets to a Java Card, the Java Card virtual machine dynamically translates bytecode into native machine language or native machine code sequences. Then the software can run independently of the chip and the card issuer (typically a bank or a telephone company). Because Java security exceeds that of native machine language, Java Cards add flexibility and reliability to applications. Nevertheless, running a JVM requires powerful hardware, a lot of processing time, and a fast internal clock.
Recent Evolution
Over the past five years, research has solved many smart card technical problems. Nevertheless, smart cards operate under the following general constraints:
-a 25-mm 2 maximum silicon chip size; -a 3-V to 5-V power supply provided by dedicated contact card readers, and external power supplied by radio waves for noncontact cards; -a 5-MHz external clock limit (characteristic values are 3.57 MHz or 3.68 MHz); and -essential tamper resistance that requires the support of various mechanisms.
However, new hardware characteristics and improved OS architectures continue to erode these limitations.
Evolving hardware features
A wide range of smart card chips have reached the market in the past five years. Samsung, Toppan, Toshiba, Emosyn, Datang, and Goldkey offer products that are less expensive and easier to program because they enable fewer features. Principal chip manufacturers such as STMicroelectronics, Infineon, Philips, Hitachi, and NEC, and even newcomers like Atmel and Fujitsu, offer highend chips with different combinations of memory, technology, and cryptographic and security features. Rather than comparing these chips, we'll focus on how the hardware features evolved to their present state.
Today's cards contain at least 128 Kbytes of ROM, associated with 64 to 128 Kbytes of electrically erasable programmable read-only memory (Eeprom) or flash memory, and 4 to 8 Kbytes of RAM. This compares with 16 Kbytes of ROM, 4 Kbytes of Eeprom, and 256 bytes of RAM offered five years ago. Silicon technology enabled most of this progress by reducing the transistor scale for smart cards from about 1 micron in 1995 to the current 0.18 micron.
Alternatives to traditional contact cards can bypass the power supply and communication limitations. First to appear were contactless chips driven by induction circuitry inside the card's body. Those chips led to applications such as e-ticketing in public transportation [8] . In 2000, contact cards with two special contacts for Universal Serial Bus (USB) ports were demonstrated at the International Forum for Card Technologies and Secure Applications (Cartes 2000). These cards enable high-speed communication with computers and even let smart cards with embedded tamper resistance replace some USB tokens (trusted devices connected to USB ports).
To overcome external clock limitations, chips now run with their own internal clock, independent or not of the external one. Five years ago, chips were using slow external clocks, even for heavy internal computations such as public-key cryptography. Existing smart card terminals weren't able to provide higher frequencies, and modifying all the terminals was inconceivable. This led to a bottleneck. Chips with asynchronous communications provided the solution: The CPU runs its own clock and uses an external clock only for communication. CPUs and their peripherals can now run a 30-MHz internal clock, increasing chip speed by a factor of 6 to 10.
Smart card OSs
Current chips differ considerably among manufacturers. Because dedicated software drivers handle specific hardware mechanisms, OS architecture has become increasingly complex. In addition, time-to-market pressures require portable and more flexible software [5] .
Today, smart card OS infrastructures have evolved from chip-dedicated to chipless infrastructures by means of low-level drivers layered by system and then application services. This organization resembles open system interconnect (OSI) layers, as shown in Figure 1 . These low-level drivers are usually chip specific but can work with either open or native cards.
As Figure 1 shows, hardware abstraction layers (chip and driver level) come first. Chip drivers of this type are implemented in native machine language. The hardware abstraction layers provide the OS with general low-level services that remain identical regardless of the chip. Usually, chip drivers manage cryptographic peripherals (for example, RSA, DES, or random generator), system peripherals for memory writes/reads/accesses and communication, and security peripherals such as normalworking-condition sensors. Those chip drivers must account for new hardware features dedicated to either security, memory access, or communication. Some features are quite simple. They include -a universal asynchronous receiver-transmitter (UART) communication driver; -memory banking, which enhances addressing capacity; -timers and an interruption handler; -error correction and scrambling for highly reliable memories; and -new security sensors.
Other recent hardware features are far more complex because they replace old software services. For example, the memory management unit, or hardwired memory access manager, replaces part of software memory allocation with an added value regarding data access security. This makes the border between chip drivers and the rest of the OS even more dif-ficult to delimit. Designers set up other mechanisms to enhance data storage security and execution efficiency. Examples include memory allocation/ deallocation (including garbage collection), memory fragmentation or ciphering, backup mechanisms for power loss, and cache memory to speed up either code execution or data storage. These mechanisms are closely linked to the chip and therefore aren't portable.
The OS for proprietary cards is built over a set of services for data management, access control, and cryptographic services or, for Java Cards, over the virtual machine and a set of application programming interfaces (APIs), as illustrated in Figure 1 . Some chips -for example, STMicroelectronics' and Infineon's high-end chips-even contain hardwired Java Card bytecodes. Data management remains OS specific whether the data are objects or files. The OS instantiates its data containers as a dedicated system structure that differs from one card manufacturer to another.
Access management is far more secure in smart cards than in any computer OS. Objects, files, and keys can be protected during read, write, and execution by secret codes or authentication-granted access rights set up by keys used in symmetric cryptography, such as DES, or asymmetric cryptography, such as RSA algorithms.
Because of chip specificity and chip-and OS-dependent mechanisms, a smart card's OS cannot be completely portable even when written in a high-level language. Yet providing interoperability and addressing time-to-market concerns required portable applications. Java Card applets enabled data and service downloads by writing only one code version, regardless of the card provider or the chip manufacturer. This ability distinguishes open cards from native cards. Changes occur mainly in the high-level layers of the OS and in the way the OS manages services during the card's life.
For example, Sun Microsystems enables interoperability by means of a specified set of APIs and bytecode subsets specified for Java Cards. Open cards propose services like any smart card OS but have few commands to direct the card. The Java Card applet enables writing commands that correspond to ap-plication requirements. Native cards, on the other hand, propose only a fixed set of commands and allow no other way to access a service.
Since the advent of open cards, smart cards no longer depend on an esoteric community of programmers for application commands. We can even imagine manufacturers providing smart cards with computers, enabling users to easily program or configure their cards as secure, personal, portable objects.
As smart cards assume a growing role in information technology, researchers will increasingly focus on security problems and possible solutions. Technical evolution will continue to blur the border between software and hardware services in the next generation of smart cards.
Future Hardware Architectures
The desire for smart card flexibility highlights an increasing need for fast software implementations. Customers want multiapplication cards, which require secure applet downloads and a VM's ability to secure monitors running applets. Cryptographic algorithms often require extensive computing power, and it's not always possible or desirable to implement them in hardware. Indeed, applications can require onboard implementation of different proprietary cryptographic algorithms specified by various telecom operators. The expense prohibits hardware implementation of all such algorithms. Adapting software countermeasures to thwart potential attacks, rather than designing a new hardware block, would permit greater flexibility.
Flexibility and security demands will certainly lead to improved embedded CPUs, just as market demands resulted in better computer CPUs. Nevertheless, the constraints are far more formidable in an embedded world, where all memories and peripherals must reside on a very small die that needs substantial protection against active and passive attacks.
Using software to obtain speeds comparable to those in the best existing DES implementations, secret-key algorithms [1, 2] , or modular multiplication coprocessors (for public-key algorithms) will require additional work. STMicroelectronics' SmartJ is one such effort. SmartJ modifies existing processors by introducing specialized instructions that will speed up smart card software implementations such as cryptographic algorithms, multiapplication OSs, and communications drivers.
32-bit RISC
The European project EP8670, funded by the European Commission through Esprit, introduced 32-bit RISC in a smart card. The project was launched in December 1993 with several European academic and industrial partners. The resulting smart card was based on the ARM7-TDMI RISC core from ARM Ltd., a partner in the project. Critical at that time (1997) was the chip's global die size. The CPU was quite large for the maximum 25-mm 2 smart card die size. Physical constraints defined in ISO standard 7816 [5] meant that RAM (512 bytes of data) and flash (32 Kbytes for code and data) were limited. These constraints are far less important in 0.18-micron processes. Today's main constraint is overall power consumption for markets like the GSM industry. SIM (subscriber identity module) cards must not consume more than 10 mA at 5 V, 6 mA at 3 V, or 4 mA at 1.8 V, according to GSM specifications [5, 6] .
32-bit CPU
Most major manufacturers now offer 32-bit RISC CPUs for state-of-the-art smart cards. These include -the STMicroelectronics SmartJ ST22, using the company's own 32-bit RISC processor; -the NEC V-Way family, using a V850 32-bit RISC processor and a SuperMAP cryptocoprocessor; and -the Infineon SLE88, to be launched in 2002.
Samsung is relatively new to this market but is also introducing a RISC architecture based on the Calm-RISC core, which exists in 16-and 32-bit versions.
Other manufacturers -for example, Philips-haven't migrated directly from 8-to 32-bit-based smart cards. They prefer using their knowledge of 8-bit complex instructionset computing to create improved 16-bit CPUs like the SmartXA. Atmel uses its own 8-bit RISC CPU and also provides some components containing the ARM 32-bit Thumb core.
The 0.18-micron smart card technology is now very close to that used for standard CPUs. The big challenge for smart card silicon manufacturers is to quickly adapt the analog parts (charge pumps, voltage and current regulators, embedded clock generators, and security sensors) and Eeprom or flash technologies to increasingly smaller digital technologies.
Improved Cryptography
The main cryptographic algorithms already implemented, or nearing implementation, on smart cards include -Secret-key encryption algorithms like the DES [1, 2] and the AES [3] , the new US encryption standard likely to progressively replace the DES. -Standard public-key cryptography based on modular multiplication -for example, RSA, Diffie-Hellman (DH) [9] , or the Digital Signature Standard (DSS) [1] . -Public-key algorithms based on elliptic curves [10, 11] . This is quite novel
and not yet extensively used. Two main types of commonly used curves will determine the need for computing power: curves over GF(p) (a Galois field over the prime p) requiring resources similar to those for standard public-key cryptography; and curves over GF(2 n ) (a GF over polynomials of size n), for which implementation can be faster than curves over GF(p) because the computations don't require carries (addition/subtraction is an XOR, and multiplication is done without internal carries).
-Hash functions, mainly SHA-1/-2 (Secure Hash Standard) and RIPEMD [1, 2] . -Several secret-key algorithms required by many telecom operators, for which specifications are not public.
Note that cryptography improvements must be compatible with fast software countermeasures.
There's still room for improving the speed of secret-and public-key cryptographic algorithms on 32-bit CPUs for smart cards.
Secret-key algorithms
In general, secret-key algorithms and the hash functions perform well on 32-bit microprocessors. Speeds range from 100 µs to 2 ms, depending on the algorithm, the CPU, and whether countermeasures are implemented. With countermeasures, speeds are closer to a millisecond. Nevertheless, these implementations can be 10 times slower than hardware coprocessors, so research to improve software implementations continues.
Some interesting secret-key cryptography instructions that improve algorithm speed and code compactness are for -rotating the content of a register; -bit permutations, expansions, and substitution, enabled by a configuration register that determines how to place the bits of one register into another register; and -fast memory accesses required, for example, for DES or AES S-boxes (nonlinear functions mostly implemented in software as lookup tables) randomization.
Public-key algorithms
The challenge in public-key cryptography implementations is to keep security high by reducing execution time and to lower product costs by eliminating the dedicated cryptocoprocessor and using only CPU instructions. The increasing length of RSA keys is also an issue. (For 1995 to 1999 performance comparisons, see Naccache and M'Raïhi [12] , and Handschuh and Paillier [13] . Execution time for typical key lengths improved considerably in recent years. Today, a 1,024-bit RSA signature using the Chinese Remainder Theorem [1] takes a maximum of 200 ms for highly secure implementations on highend smart cards. Onboard key generation for 1,024-bit RSA keys takes less than 10 seconds.
For greater software flexibility and lower global costs, researchers can eliminate cryptocoprocessors by adapting efficient CPU cores for smart cards, with dedicated instructions for cryptography.
In standard public-key cryptography, fast software implementation is more problematic than for secret-key algorithms. Indeed, public-key algorithms, on the average, are at least 100 times slower than secret-key algorithms. We might use software to implement secretkey algorithms on existing low-end 8-bit smart card microprocessors, but not for public-key algorithms, where processing time might be several minutes. Basic operations in secret-key algorithms use simple Boolean and arithmetic instructions available on existing CPUs. However, to improve CPU speed, public-key algorithms require more-specific instructions than those usually available on standard 32-bit RISC chips for smart cards. The "Implementing standard public-key cryptography in smart cards" sidebar explains how we could improve the performance of these processors.
Smart Card Tamper Resistance
Like other security devices, smart cards are subject to various types of security stresses.
Noninvasive intrusion monitors the chip's operation; invasive intrusion physically breaks the smart card chip apart. We mention only the most recent and notable security flaws.
Anderson and Kuhn [14] , and Kömmerling and Kuhn [15] provide a more exhaustive list and comprehensive descriptions of attacks on tamper-resistant devices.
Noninvasive intrusion

Several types of noninvasive intrusion have emerged over the past five years.
Side-channel analysis. Retrieving secret information from a smart card through its leakage is called a side-channel analysis. One attack of this type relies on smart card power consumption analysis and may consist of either simple power analysis (SPA) or differential power analysis (DPA) [16] . Another class consists of timing analysis [17] .
SPA consists of observing variations in a chip's global power consumption to discover information that can compromise private data. One particular SPA on straightforward RSA cryptographic algorithm implementations looks at the increase in power consumption each time a modular exponentiation (evident even when using a hardware coprocessor) is performed. This lets the intruder deduce bit by bit the secret expo-nent. The RSA signature, in fact, consists of a modular exponentiation (the exponent being the secret key). In a straightforward square-and-multiply implementation of the exponentiation, each bit of the key determines whether a modular multiplication must be performed. An SPA will generally produce more accurate results if the intruder knows the hardware architecture.
DPA is more sophisticated though not necessarily more effective than SPA. DPA retrieves information by performing a statistical analysis on power consumption curves for several executions of the same algorithm with different inputs.
For software implementations developed with an awareness of timing analysis, such intrusions can now be easily avoided. With older implementations, several optimizations implied algorithms with varying timings that depended on the data or the cryptographic keys. Current implementations rely on constant timings or are at least independent of data and secret keys, thereby eliminating the threat of timing analysis.
A new and potentially troublesome side-channel intrusion involves electromagnetic analysis (EMA) [18] . Each part of a silicon chip emits electromagnetic fields indicative of chip activity. EMA can be implemented just like SPA or DPA; only the measured physical quantities differ. EMA has a two-dimensional resolution. Measuring only the local electromagnetic field ignores activity in other parts of the chip, thereby increasing the effectiveness of the intrusion. Power consumption, on the other hand, is global to the smart card. Gaining information only on the chip's global activity makes it more dif-ficult to interpret measurements.
Fault induction. Applying a combination of environmental conditions that cause a smart card chip to produce a wrong computation can result in the leakage of stored or computed confidential information, personal data, or even code. This is called fault induction. Such security stresses may result from unusual values in the power supply, clock frequency and duty cycle, or working temperature. Likewise, ultraviolet lights, lasers, microwaves, ion beams, and so on, may compromise card security.
Nevertheless, because it is better to anticipate than correct errors, security sensors are essential to avoid extensive software and hardware countermeasures.
Invasive attacks
Smart card chips can be physically and irreversibly modified by invasive attacks. Several such intrusions are possible through standard reverse engineering, with particular attention to the location of keys, personal identification numbers, or any other secure information stored in RAM, Eeprom, ROM, or flash memory. Some memories are write protected by burned fuses, which an attack may try to defeat. Invasive attacks can also capture information flowing through buses, registers, the ALU, and any other hardware component. Other intrusions aim at disconnecting or avoiding activation of smart card security sensors.
All these methods apply to more than just smart cards. They can also compromise many tamper-resistant hardware devices that don't necessarily have onchip protection against intrusions. These are often multichip devices protected only by resin and external detectors or sensors. Smart cards are probably already far more protected than these other tamperresistant devices.
Software Countermeasures
Currently, software countermeasures in smart cards can protect against most known side-channel analysis and fault induction approaches. Nevertheless, the cost in reduced performance is high. Full protection against fault analysis can require multiple executions of identical computations, thereby doubling performance loss.
With SPA and DPA, we differentiate between protection for public-key and for secret-key algorithms. Boolean masking can effi-ciently protect secret-key algorithms. A random value (the mask) keeps the message and key hidden not only while they're stored in memory but also during processing by the cryptographic algorithm itself. During the algorithm's nonlinear phases, it is necessary to use randomized S-boxes with random masks [19] . Since the masks must change regularly, it's necessary to recompute new S-boxes on the fly from an initial value in ROM. This operation slows execution, potentially by a factor of three to five.
For standard public-key cryptography, implementing efficient software countermeasures can be more complex. Indeed, randomizing the message and private key may not be enough to protect against SPA. A careful implementation allowing for the hardware particularities, to reduce detectable power consumption variations, is sometimes necessary.
Hardware Countermeasures
The two types of hardware countermeasures are detectors and active components.
Detectors
Sensors or detectors work continuously in the background when the power supply activates the smart card. They are often implemented to sense changes in normal and ultraviolet light, temperature, voltage, and external clock frequency. Upon detecting abnormal conditions, they often perform hardware reset and flag activations.
Active components
Active protection can be implemented as -memory encryption so that data is never stored in plaintext; -a metal grid layer on top of the chip that detects probing and visual analysis; -an internal voltage and current regulator; -an internal clock generator with random clock jittering (unstable internal oscillator) to thwart the synchronization of power consumption curves even with a processor working under identical working conditions with the same program and with identical data; or -additional random processor dummy instructions, cycles, and interruptions that make DPA more difficult by obscuring synchronization between power curves.
Furthermore, scrambled placing and routing can increase the difficulty of reverse engineering and avoid simple line (bus) probing and trivial localization and deactivation of sensors. This means that except for memory, all buses and overall logic (including the CPU) are spread randomly over the chip. However, such placing and routing is difficult to achieve and prohibits deep space optimization.
Adding checksums to memory and providing critical-part redundancy can also deter fault inductions. It may be possible to build detectors able to react to certain fault inductions on parts of the chip. In addition to memory encryption, scrambling or encrypting transmitted data on buses may reduce information leakage.
One of the most promising ways to reduce side-channel analysis is to implement all or part of the smart card chip in dual rail/logic (digital logic in which each bit is coded using two complementary bits), thereby keeping overall power consumption constant to defeat SPA and DPA.
Some smart card manufacturers are investigating asynchronous design technology as an alternative for building smart card chips that resist power analysis. Synchronizing DPA curves on the rising and falling clock edges will be more difficult when any notion of a clock has disappeared. We can expect to see such technology in future smart cards.
Conclusion
The smart card industry is now mature, integrating performance and security features into highly reliable products. The symbiosis between chip and OS makes the card one of the smartest and least expensive of security tokens, enabling it to secure the e-world by protecting users from malicious systems and systems from malicious users. Over the next few years, we will probably see an increasing number of CPUs developed specifically for smart cards, with dedicated, highly secure buses and secure memories. and a member of the IEEE Computer Society, the IEEE Solid-State Circuits Society, and the International Association for Cryptologic Research.
Nathalie Feyt is a smart card engineer at Gemplus in the Card Security Group, where she also leads the public-key algorithms development team. Her research interests include security and cryptography. She developed the first Gemplus public-key card, included in public-key infrastructure solutions. Feyt has an MS in electronics and computer science from the University of Bordeaux, Ecole Nationale Supérieure d'Electronique et d'Informatique de Bordeaux, France. She is a member of the Java Card Forum.
Direct questions and comments about this article to Jean-François Dhem, Gemplus, R&D-Card Security Group, Parc d'Activites de Gemenos-BP 100, 13881 Gemenos Cedex, France; jean-francois.dhem@gemplus.com.
B Implementing Standard Public-Key Cryptography in Smart Cards
Improved performance of standard 32-bit RISC processors for smart cards when performing public-key cryptography software implementations is possible with some additional or modified processor instructions. The basic primitive of standard public-key cryptographic algorithms like RSA, the Digital Signature Standard (DSS), and Diffie-Hellman (DH) is the modular multiplication of large integers (typically 512 to 2,048 bits).
Mathematically, the equation is
where A, B, and N are large integers. To work on 32-bit architectures, these integers must be decomposed into 32-bit blocks (see Figure A) , and the full-size computations should be performed by emulation using 32-bit numbers. We can always write integer A as a sum of elementary blocks of size t (of t-bit CPU registers), as shown in Equation 2. The size (in bits) of A (often noted |A| is equal to tn, where n is the number of t-bit blocks needed to represent A.
This decomposition provides a way to store in microprocessor memory these very large numbers as an array of n elements of t bits each. Decomposing the modular multiplication in Equation 1 as shown, we obtain
The decomposition of Equation 4 isn't the only possibility for implementing the computation. We could, for example, first multiply A by B and then reduce by N . However, smart card RAM has limitations of size and access time, so researchers need ways to minimize memory use without slowing down the computations. Equation 4 minimizes the RAM needed for intermediate computations.
Moreover, the equation shows that interleaving (at each step) the multiplication and reduction phase minimizes memory accesses. Indeed, if we implement this equation iteratively, the basic operation at step i is Temp2 t + (A i B) mod N . Since the whole computation is done modulo N , we can also write it as Temp2 t + (A i B) ) mod N , where we immediately see that it requires at most computations on n + 1 blocks. We can also represent these intermediary computations graphically, as shown in Figure B , where the computation of A i B is represented as a sum of products Figure B , we add the aligned blocks. As Equation 1 shows, we can also represent the modular reduction graphically by a subtraction of QN decomposed in the same way as it is for A i B. We can replace the subtraction with an addition by replacing N by its 2s complement (noted N ). We can evaluate the intermedi- We can conclude, then, that efficiency requires at least a hardware multiplier. Hardware multipliers on embedded 32-bit RISC processors were initially designed for digital signal processing (DSP) applications. This means the processor has a multiplier with a 64-bit accumulator. In terms of input bits, we can express this as a multiplier of 32 × 32 + 64 bits with a result of 64 bits. A primary disadvantage of this adder, when emulating computations on larger integers, is that we don't have access to the accumulator's carry bit.
There are several ways to handle the carries when doing the additions shown in Figure B . Figure C shows the two most relevant ways. Figure C1 shows in more detail what happens locally in Figure B when adding the intermediary computations with a DSP-like multiplier. In Figure C1 (light gray boxes) we can see that A i B j is first computed and added to T j+1 2 t + R0 + Carry j0−2 . Then QN j is computed and added at the same time to the result of the previous computation added to Carry j1−2 . These two computations generate new values for the carries and a new value of R0 (hashed block in Figure C1 ) that must be added to the next computations. The multipliers 32 × 32 + 64 bits for embedded RISC processors may not have the capability to handle these carries. Figure C2 shows a more suitable computation rearrangement. A i B j + R0 + U j (in light gray boxes) is computed first, then QN j + R1 is computed and added simultaneously to the inferior part of the previous computation. The upper part of the two computations is stored back in R0 and R1. These computations require a multiplier of 32 × 32 + 32 + 32 bits. The method in Figure C2 better because it doesn't need a carry bit for the multiplier and consequently increases the algorithm's fluidity by avoiding the necessity of handling the carries generated using the method shown in Figure C1 . Theorem 1 shows why there is no carry in the method in Figure C2 . Modifying a standard DSP multiplier to meet such a requirement isn't difficult. It doesn't even require an additional large accumulator with larger delays, as in a DSP-like version. In other words, the internal delays of the proposed modified multiplier will be no worse than for an equivalent multiplier of 32 × 32 + 64 bits.
In addition to the multiplication computations themselves, memory accesses also occur to read and store back the intermediary results in R Temp and to read R A , R B and R N . Figure D shows a typical implementation in pseudocode of the resulting inner loop of a modular multiplication using the described multiplier 
R0
A i B j of 32 × 32 + 32 + 32 bits, where the R with a subscript represents a CPU internal register.
Ordinary compiler optimization techniques apply to the pseudocode in Figure D. One example is loop unrolling, whereby special instructions handle (in a minimum number of clock cycles) a counter increment and a conditional branch on the counter's value. Such instructions occur infrequently on 32-bit RISC smart card processors, but they're common in many other CPUs -for example, Intel's 8051, Motorola's 68020, and IBM/Motorola's PowerPC.
Mixing multiplication with memory accesses lets us optimize use of the multiplier, which can then be implemented in a multicycle version, thereby reducing die size.
Schemes similar to these have already been successfully implemented in dedicated hardwired coprocessors for smart cards [A5] . Nevertheless, handling modular multiplication in software, with the help of dedicated instructions, improves flexibility and reduces smart card chip die size and power consumption, eliminating the need for cryptocoprocessors.
Thus, with only a few modifications in 32-bit RISC processors -adding some simple instructions when necessary and slightly modifying the DSPlike multipliers-we can reach cryptocoprocessor speed while allowing greater flexibility and lowering cost.
