Abstract-'Encrypted computing' is an approach to preventing insider attacks by the privileged operator against the unprivileged user on a computing system. It requires a processor that works natively on encrypted data in user mode, and the security barrier that protects the user is hardware-based encryption, not access. We report on progress and practical experience with our superscalar RISC class prototype processor for encrypted computing and supporting software infrastructure. This paper aims to alert the secure hardware community that encrypted computing is possibly practical, as well as theoretically plausible. It has been shown formally impossible for operator mode to read (or write to order) the plaintext form of data originating from or being operated on in the user mode of this class of processor, given that the encryption is independently secure. Now we report standard Dhrystone benchmarks for the prototype, showing performance with AES-128 like a 433 MHz classic Pentium (1 GHz base clock), thousands of times faster than other approaches.
Introduction
I F its arithmetic is modified appropriately, then a processor continues to operate correctly, but all its states become encrypted [1] . That means that encrypted data is read and written at encrypted addresses, and both data and addresses pass through the internal registers of the machine in encrypted form, and that will be called here encrypted computing. Since the processor's outputs to as well as its inputs from memory (RAM) and other peripherals are encrypted, memory content is encrypted too.
Running an appropriate machine code instruction set that has provision for encrypted constants, it turns out to be formally impossible for the operator to infer either statistically or logically or even by experiment what the plaintext form of the encrypted data in such a processor is, despite having continuous read and write access to it and the program code (see Section 6) . Those recent theoretical results have put on a firm footing the security of designs for encrypted computing that depend on a modified arithmetic, while it was always probable from an engineering point of view that such processors would run fast. That is because in principle only * Correspondence to Zhiming Liu, RISE, Southwest University, 2 Tiansheng Rd, Beibei, 400715 China. one piece of stateless logic, the arithmetic logic unit (ALU), need be changed 1 and the rest may be appropriated from current design practice. This paper provides experimental data from our 'KPU' 2 class prototype to support that view. If the reader is to take away one thing from this paper, it should be the understanding that in operator mode this kind of processor runs unencrypted, while in user mode it runs encrypted. Encryption, not access, is the security barrier that protects the user against the operator and thus all system 'insiders', whether the attacker is a subverted operating system, or a bribed system administrator. Operator mode is a part of all our computing systems and it cannot be done without: it is the mode your processor switched to just now to take from disk the bytes that you are reading in your PDF viewer. It is still all-powerful in the class of processor discussed here, except that it 'runs unencrypted', with plaintext inputs, outputs and intermediate states, while user mode 'runs encrypted'. It is not claimed that user mode running cannot be interfered with, because it can. The operator can wipe memory, for example. The formal result is that the operator cannot know the plaintext form of user data (Section 6). That also turns out to imply it cannot be rewritten to order to a value that is defined independently, such as π, or the key for the encryption.
The approach is simple and its aim apart from fast running is to permit a clear security analysis. That has come out well [2] , [3] in the context of an appropriately designed processor instruction set and a compiler that leverages the instruction set to smooth out statistical biases (see Section 6). Without those two extra components, there would be known plaintext attacks (KPAs) [4] based on the principle that x−x=0 however encrypted, or on the statistic that human programmers use 0, 1 more than other values.
The medium-term goal here is a secure platform for remote computing in the cloud [5] , perhaps also for embedded systems such as automobiles or uranium centrifuges. Experience with the evolving system guides our understanding of what may be possible. Remote ('batch') computing is an initial target because predetermined program codes with a fixed number and depth of iterations are involved -'encrypted matrix multiplication' is a fashionable example [6] . Batch working allows the encryption to be changed between runs, which makes attacks harder, and it is more difficult to change the code or the encryption or its key in a continuously running and critical environment such as a driving automobile, but we now believe it will be possible. For batch mode, the remote user compiles and encrypts the program, sends it for remote execution along with encrypted inputs, and receives back encrypted outputs. The key is either already in the machine or loaded in public view via secure hardware, but that is not the concern of this paper on computation (see Section 2.11 for a discussion).
The studies here are based on a sequence of behavioural models, beginning in 2009 with a demonstration that dropping a changed ALU into a model in Java of a pipelined processor (http://sf.net/p/jmips/) gave rise to encrypted running (http://sf.net/p/kpu/). The confirming theory was not published until 2013 [1] . From 2014 to 2016, the or1ksim simulator (http://opencores.org/or1k/Or1ksim) for the OpenRISC (http://openrisc.io) processor architecture was modified to 64-bit and now 128-bit operation and cycle-accurate simulation covering the full processor pipeline. The aim in that has been to (i) demonstrate the KPU principle 2 to engineers who may not understand or accept mathematical proofs and formally-oriented computer science, and (ii) explore its limits. With respect to (ii), it was unknown beforehand if conventional instruction sets and processor architectures would be compatible with the idea, and that may now be taken as confirmed in great part. It has also become clearer, however, that not every program can run encrypted in this context -compilers and programs that arithmetically transform the addresses of program instructions (as distinct from addresses of program data) must run unencrypted because program addresses are unencrypted. That design prevents KPAs on what would be encrypted but predictable address sequences. The largest application suite 3 ported so far is 22,000 lines of C, but it and every application ported (now about fifty), has worked well, surprising the authors.
Our models have provided a good haul of metrics via the burgeoning software infrastructure and the measures are reported here. The standard Dhrystone [7] v2.1 benchmark shows 104-140 MIPS running encrypted, matching a 433-583 MHz classic Pentium. But this paper's particular objective is to summarise the state of knowledge in a security engineering forum and convince that it does work, encouraging the community's increased focus and scrutiny.
The organisation of this article is as follows. Section 2 encapsulates the processor design and working in bullet points for the reader. The aim is to address early on what experience tells us are common misconceptions based on false analogies with known security devices. In particular, encrypted memory (also 'oblivious RAM', 'ORAM' [8] ) has nothing to do with this -memory is not part of a processor. Nor does Intel's 'SGX' TM range of machines, which use keys to control access to different (encrypted) memory regions but do not run encrypted, have to do with 3 . IEEE floating point test suite at http://jhauser.us/arithmetic/TestFloat.html. it. Nor is there particular relevance for key-management here -why is discussed in 2.11 -and this paper does not discuss it. Security engineers may later design key management as they wish. The closest contemporary related experimental processor architectures are discussed in Section 3. Security engineering considerations in putting the principles into workable practice are described in Section 4 and this is the most relatable section for an engineer, but those details are not crucial as they can and will be changed to suit newer technologies. In particular two 'tricks' of implementation are described (first written down in 2016 in [9] and [10] ) that restore good performance in this context to what is intentionally an old-fashioned processor architecture, but the idea is that specialists in computer architecture will do better than us by applying the same principles to more modern designs. Further hardware optimisations are described in Section 5. Section 6 sets out the modified RISC [11] instruction set that makes encrypted computation secure, in combination with an 'obfuscating' compiler, briefly described.
Summary of design and working
This section summarises the class prototype's design and working in bullet points for the reader to refer back to. 2.1 Architecture. The prototype's organisation, described in [9] , follows the classic single pipelined RISC design of [11] , clocked at a nominal 1 GHz with 3 ns internal cache and 13.5 ns external RAM. Register layout and functionality conforms to OpenRISC v1.1 (see http://openrisc.io) with 32 general purpose registers (GPRs) and 2
16 special purpose registers (SPRs). Some SPRs' control/monitor functions and access are modified for security as described in Section 4. Registers and buses are 64 or 128 bits wide (it differs per model) for encrypted 32-bit or unencrypted 64/128-bit data.
The prototypes have all had speculative branch execution and prediction, and data forwarding along the pipeline in the same clock (bypassing registers). Successive iterations have gained features such as on-the-fly instruction reordering. 2.2 Modes. The processor operates in the two classic modes: user and operator mode, as per the OpenRISC specification. User mode works encrypted on data that is 32-bit beneath the encryption and operator mode works unencrypted on 32/64/128-bit data. Operator mode has unrestricted access to all memory and registers (both GPRs and SPRs) and instructions. User mode has access to all GPRs and most SPRs following convention and the OpenRISC specification with slight additional 'off-limits' as specified in Section 4 (e.g., the SPR containing the processor version number is off-limits or else randomised for user mode). Memory access is not more restricted for user mode than is conventional (usually the top half of address space). Instruction semantics in the two modes is as specified by OpenRISC with the proviso that the data and data address are encrypted in user mode. However, program (not data) addresses in user mode remain unencrypted, and program instructions are wholly unencrypted except for embedded constants.
As is conventional, operator mode can switch to user mode (and not vice versa) by writing the status SPR.
2.3 Adversaries. The operator is the adversary who tries to read the user's data, and/or rewrite it. The notion of 'operator' is conflated with the operator mode of operation of the processor, in which instructions have access to every register and memory location. The idea is that, as the most privileged user, 'operator' stands in for all, in that user data that is secure from the operator is secure from all. 2.4 Successful attacks are where the operator reads the value of user data beneath the encryption, or writes to order an encrypted value containing an intended plaintext. It is also success to merely know with some probability above random chance (i.e., 1/2) what a specified bit of user data is beneath the encryption, and similarly for writing. The operator can do anything within the power of programmed instructions (the 'ABI'; application binary interface) to do. That might consist, for example, of writing debug system calls between instructions of the user program and rerunning it one step at a time on the same or different inputs. 2.5 Simulation. The open source OpenRISC 'Or1ksim' simulator, available from http://opencores.org/or1k/Or1ksim, has been modified to run the processor models. It is now a cycle-accurate pipeline simulator, 800,000 lines of C code having been written over 2 years real time and 25 years estimated software engineering effort, through 8 processor prototypes. The source code archive and development history is available at http://sf.net/p/or1ksim64kpu. 2.6 Instruction set. In user mode, the processor runs the 32-bit OpenRISC instruction set modified for encrypted operation. Opcodes and register indices are not encrypted, but a prefix instruction has been introduced to hold spillover from encrypted constants that do not fit in one 32-bit instruction. In operator mode, the (32-bit long) OpenRISC instructions for 64/128-bit arithmetic on unencrypted data are available. 2.7 Security of computation. Adapting all the standard OpenRISC instruction set for encrypted working has confirmed that it is possible to write (unencrypted, operator mode) operating system support for user programs (running encrypted). The operating system generally does not need the decryption of a user datum to do what is required (e.g., output it, encrypted, as is). But the experience has clarified that conventional instruction sets are inherently insecure with respect to the operator as adversary, who may steal an (encrypted) user datum x and put it through the machine's division instruction to get x/x, which is an encrypted 1. Then any encrypted y may be constructed to order by repeatedly applying the machine's addition instruction. By comparing the encrypted 1, 2, 4, etc. obtained with an encrypted z using the instruction set's comparator instructions (testing 2 31 ≤z, 2 30 ≤z, . . . in turn and subtracting whenever it succeeds), the value of z can be efficiently deduced. This is a chosen instruction attack (CIA) [12] . Part of the novel contribution of this paper is a 'FxA' instruction set for encrypted RISC against which every attack logically fails, in that it provably does no better than guessing (see Section 6). 2.8 Encryption. The prototypes models have been tested fitted with Rijndael-64 and -128 symmetric encryption (the latter is the US advanced encryption standard (AES) [13] ), RC2-64 [14] and Paillier-72 [15] . The last is an additively homomorphic 4 cipher that runs without keys in the processor. In principle any 'reasonable' block cipher with a block size that fits in the machine word may be integrated in the pipeline. The symmetric encryptions are supported by en-/decryption hardware fitted as several stages of the processor pipeline. 5 For homomorphic encryptions a multistage arithmetic unit is used instead. All encryptions are one-tomany. For symmetric encryptions, pseudo-random padding under the encryption is generated by hashing operands. For Paillier, 'blinding' multipliers are generated instead. 6 The choice of encryptions has been dictated by the development path. The open source Or1ksim simulator had to be expanded from 32 bits to 64 (as well as made cycleaccurate and pipelined) and at that point 64-bit ciphers could be handled. The OpenRISC instructions require two 32-bit prefixes per instruction for 64 bits of encrypted data. Two prefixes is also sufficient for 72 bits of encrypted data, so Paillier-72 could be accommodated without further toolchain changes, but it meant doubling processor path widths from 64 to 128 bits to hold 72-bit data. AES-128 then became possible, requiring four 32-bit prefixes per instruction.
Paillier-72 is insecure in practical terms but has served to investigate use of a homomorphic encryption in this setting. Paillier does not become as secure as AES-128 until 2048 bits, but 2048-bit Paillier arithmetic would use too many pipeline stages for practicality. Nevertheless, the closest competing design is HEROIC [17] (see Section 3), a stack machine running encrypted with a 'one instruction' machine code and 2048-bit words encrypting 16 bits of data. It does 2048-bit Paillier arithmetic in hardware, so it is possible (HEROIC runs 4000 cycles of 200 MHz hardware per arithmetic operation). 2.9 Toolchain. The existing GNU gcc v4.9.1 compiler (github.com/openrisc/or1k-gcc) and gas v2.24.51 assembler (github.com/openrisc/or1k-src/gas) ports for OpenRISC v1.1 have been adapted for the encrypted instruction set. Executables are standard ELF format. The source codes are at sf.net/p/or1k64kpu-gcc and sf.net/p/or1k64kpu-binutils. Only the assembler needs to know the encryption key.
For more control over internals, we are writing our own obfuscating, encrypting, compiler toolchain from scratch (http://sf.net/p/obfusc). The toolchain comprises C compiler, assembler, linker, virtual machine and object code reader. ANSI C and most GNU extensions are currently supported, apart from union types and computed gotos. The major limitation is that pointers must be declared together with a memory zone into which they point and are confined to. 2.10 Limits. Word width (i.e., encryption block size) up to 2048 bits is contemplated with current technology. Memory paths would need to be appropriately broadened.
Key management.
There is no means to read keys once they have been embedded in the processor, where they configure the hardware functions. As the design nears production, a decision will be taken to embed keys at manufacture, as with Smart Cards [18] or use a DiffieHellman circuit [19] that safely loads the key in public view.
Note there is no direct consequence of running with the wrong key because if user A runs with user B's key, then user A's program will produce rubbish, as the processor arithmetic will be meaningless; if user A runs user B's program while user B's key is in the machine, then the output will be encrypted for user B's key, and the input will need to be encrypted in user B's key, and user A can neither supply nor understand that. Security in this context depends not on access but on whether A, who may be the operator, can leverage observations of B's computations to learn about the encryption, and that is answered in Section 6 -negatively, for the instruction set architecture described. 2.12 Hardware-level security. A low-level hardware protocol described in [10] in 2016 (see 4. 3) is proved in [10] to guarantee that data originating in user mode can never be seen in unencrypted form in operator mode, and conversely. Stack machines are different from conventional von Neumann architectures but there have been hardware prototypes [20] aimed at Java. HEROIC works by substituting 16-bit addition by multiplication of 2048-bit encrypted numbers modulo a 2048-bit modulus m. Multiplying above the Paillier encryption E is the same as adding beneath the encryption:
A difficulty is that the addition on the right is not mod 2 16 , so the sum has to be renormalised mod 2 16 under the encryption, which accounts for half the cycles taken. It is done by subtracting 2 16 via (+) and looking up a 'table of signs' for encrypted numbers to see if the result is negative or positive. To facilitate that, HEROIC encryption is one-toone, not one-to-many, 6 or the table would be too large. It is already 16M bytes in size (2 16 × 2048 bits). The same table is also used for comparison operations (less than, etc).
That technique is also used with Paillier in our KPU models, except that the table of signs is too large to site locally with current technology (at 2 32 ×72 bits times the 6. Paillier may embed 'blinding factors' in the encryption. Those are multipliers r n mod m, where n=pq and m=n 2 is the public modulus. Paillier decryption involves raising to the power of the order φ=(p−1)(q−1) of the multiplicative group mod n, so r n becomes r φn = (1+kn) n = 1+kn 2 + . . . =1 mod n 2 and does not affect the decrypted value. HER-OIC's one-to-one encryption does not use variable blinding. number of aliases per encryption), so signs are calculated outside the simulation and cached.
Encrypted multiplication and other operations are subroutines under Paillier. The 'selling point' is that (+) means that the modified arithmetic in the processor needs no keys. Despite the headline, the table of signs amounts to a key and it must be changed per user. 3.1 Ascend [21] protects instructions and data from the operator by both cryptographic and physical means. Code on the way to the processor is encrypted, data I/O is encrypted and the processor runs in 'Fort-Knox'-like isolation, matching pre-defined statistics on observables. Communication with RAM is encrypted.
Physical isolation plus encrypted memory has emerged many times as an idea (e.g., [22] ) and success means doing it as well as Ascend does. Otherwise side-channels such as cache-hit statistics [23] and power drain [24] can leak information. Ascend runs RISC MIPS instructions [25] and slows down by 12-13.5× in encrypted mode with AES-128 (absolute speeds are not given in [21] ), as compared to 10-50% slowdown for our models (Section 5). 3.2 Intel's SGX TM ('Software Guard eXtensions') processor technology [26] is often cited in relation to secure computation in the cloud, because it enforces separations between users. However, the mechanism is key management to restrict users to memory 'enclaves'. While the enclaves may be encrypted because there are encryption/decryption units on the memory path, that is encrypted and partitioned storage, a venerable idea [27] , not encrypted computing.
SGX machines are used [28] by cloud service providers where assurance of safety is a marketing point. But that is founded in customer belief in electronics designers 'getting it right' rather than mathematical analysis and proof, as for our and HEROIC's technologies (see Section 6). Engineering may leak secrets via timing variations and power use and SGX has recently fallen victim [29] . 3.3 IBM's efforts at making practical encrypted computation using very long integer lattice-based fully homomorphic encryptions (FHEs) based on Gentry's 2009 cipher [30] deserve mention. An FHE E extends the Paillier equation (+) to multiplication on the right. But it is 1-bit, not 16-or 32-bit arithmetic under the encryption. The 1-bit logic operations take of the order of 1s [31] on customised vector mainframes with a million-bit word, about equivalent to a 0.003 Hz Pentium, but it may be that newer FHEs based on matrix addition and multiplication [32] will be faster. The obstacle to computational completeness is that which HEROIC overcomes with its 'table of signs': encrypted comparison with plain 1/0 output is needed, as well as the encrypted addition (and multiplication), but HEROIC's solution is not feasible for a million-bit encryption. 3.4 Moat electronics. Classically, information may leak indirectly via processing time and power consumption, and 'moat technology' [33] to mask those channels has been developed for conventional processors. The protections may be applied here too, but there is really nothing to protect in terms of encryption as encrypted arithmetic is done in hardware, always taking the same time and power. There are separate user-and operator-mode caches in our models, and statistics are not available to the other mode, so sidechannel attacks based on cache hits [23] are not available. 3.5 Oblivious RAM (ORAM) [8] and its evolutions [34] is often cited as a defense against dynamic memory snooping. That is in contrast to static snooping, so-called 'cold boot' attacks [35] -physically freezing RAM chips to retain the contents when power is removed -against which HEROIC, SGX and our technology defend because memory content is encrypted. Also, in our technology, data addresses are encrypted and vary during running. ORAM extends that by continuously remapping the logical to physical address translation, taking care of aliasing, so access patterns are masked. It also hides programmed accesses among randomly generated accesses. But it is no defense against an attacker with a debugger, who does not care where the data is stored. It does not defend against the operator and operating system, as the technology here does.
KPU hardware security and engineering
Two 'tricks' of KPU design for good performance were described in [9] in 2016 and are summarised in 4.1, 4.2. 4.1 Dual pipeline configuration. There are two configurations of the pipeline, 'A' and 'B', for encrypted running with symmetric encryption (Fig. 2) . There is only space for one (multi-stage) encryption/decryption unit and some instructions need encryption after the execute stage (A), some need decryption before (B), and the pipeline is flipped to match. A variant A config. is used for Paillier (Fig. 2 top) . 4.2 The arithmetic logic unit (ALU) operation is extended in the time dimension to cover a series of consecutive (encrypted) arithmetic operations in user mode. The first of a series is associated with a decryption event and the last with an encryption event (note that by 'arithmetic operations' is meant the arithmetic stages of individual instructions, not the whole instructions). That reduces the frequency with which the encryption/decryption unit is used. 4.3 ALU operation as per 4.2 is supported by a low-level hardware protocol described in [10] in 2016. Shadow sets of registers and caches (Fig. 1) . Two extra register bits track data origins. 4.5 Security. In [10] , principles (a-c) plus (*) are shown to guarantee (c.f. 2.12) operator mode never sees in unencrypted form data originating in user mode and vice versa. Further guarantees obtain at higher level as described in Section 6. 4.6 Multiuser. Changing the encryption key signals a change of user and empties the shadow registers, so one user cannot gain access to another's unencrypted data in registers, but in any case the argument in 2.11 says that access is not an issue in itself and the instruction set is the actual danger (the fixed instruction set is given in Section 6). 4.7 Further modifications to conventional design include an address translation look-aside buffer (TLB) in two parts. The conventional TLB is now a back-end (it remaps addresses 4096 at a time) and a new front-end maps individual encrypted addresses to a backed range in first-come, firstserved order. As data that will be accessed together tends to be first accessed close together, this reestablishes the effectiveness of cache readahead though encrypted addresses are spread randomly over the whole cipherspace. The TLB front-end will eventually be limiting, but it does not affect well-designed programs whose footprint fits in cache. It has no knowledge of the plaintext address, and remaps on writes.
Performance
The original Or1ksim OpenRISC test suite codes (written mostly in assembler) established benchmarks for early prototypes, when no or very rudimentary code (C) compilation was available. Most modern performance suites still cannot be compiled because they rely on support such as linear programming and math floating point libraries, as well as system support such as 'printf'. If those could be ported in good time, debugging would take months (the original OpenRISC gcc compiler has bugs, such as sometimes not doing switch statements right, sometimes not initialising arrays right, etc.). In particular the well-known 'spec' benchmark suite is unavailable because its source code is commercially protected. Some less evolved benchmarks are running, in particular Dhrystone v2.1. Table 1 shows baseline performance (left; L) in the instruction set add test of the suite, with RC2 64-bit symmetric encryption, repeating the 2016 test in [10] so progress since can be seen. The 64:16:20 mix for arithmetic:load/store:control instructions (no-ops and prefixes ignored) is close to the 60:28:12 mix in the standard textbook [36] . At the time of the 2016 test, the program spent 54.8% of the time in user mode, and 52.7% now, which is 2.1/54.8 = 4% better encrypted running. Pipeline occupation is now 1−20.7/52.7 = 60.7% in encrypted mode, for 607Kips (instructions per second) with the 1 GHz clock.
The top right subtable shows that individual branch records (hits) gain little (44/10) over aggregated data (misses; 35/9). The middle right subtable shows all data is write-before-read (read hits 100%) and near all (99.7%) writes are repeats to a few (0.3%) locations. The crypto table shows that most en-/decryptions are elided via writeback caching. The raw numbers would be 2,942 (store) and 25,995 (load+immed Paillier arithmetic takes the length of the pipeline to complete and that stalls instructions behind that need the result until the instruction ahead has finished, leaving the pipeline mostly empty, and that accounts for what is observed. The disparity with symmetric encryption is as great on all tests and since at 72 bits the Paillier implementation can only be proof-of-concept, the remainder of this text will concentrate on performance with symmetric encryptions. Performance with them is sensitive to data-forwarding along the pipeline.
Turning off forwarding and instruction reordering shows 33% of processor speed is due to forwarding, while reordering gives another 3% (Table 3 ; top left entry from Table 1) : Since the 2016 account in [9] three solutions tailored to the architecture and its bottlenecks have been implemented: (a) instructions with trivial functionality in the execute phase (e.g., cmov, the 'conditional move' of one register's data to another) but stalled in read stage now speculatively proceed on the assumption that they will be able to pick up the data via forwarding later during their progress through the pipe; 7 (b) the fetch stage has been doubled to get two instructions per cycle and adjoin the prefix information to the prefixed instruction instead of taking up pipeline slots in its own right; (c) a second pipeline has been introduced to speculatively execute both sides of a branch. A branch in a branch falls back to predictive speculation in one pipeline. Statistics changes are made only on branch confirmation.
'Flexible staging' (a) drops the count from 296368 to 259349 cycles and then (b), (c) contribute as in Table 4 : Branching both ways (c) was ineffective because only 3717 branches mis-predicted, but harder-to-predict code benefits.
Those RC2-64 tables provide approximate numbers for AES-128 via the classic Dhrystone v2.1 benchmark in Table 5 : By the measure, the AES-128 prototype is running as a 433 MHz classic Pentium, or 200 MHz Pentium M. 8 The results are compiler-sensitive, as shown by variation through optimisation levels O0-O6 for the Pentium M in Table 5 , and our compiler is rudimentary. The slowdown for 128-bit AES over 64-bit RC2 is due to the 4, not 2, prefixes for an immediate constant in each instruction carrying immediate data. It emphasises that compilers for encrypted instruction sets must avoid inline data in instructions. The RC2 prototype equates to a 583 MHz classic Pentium, 266 MHz Pentium M.
The results may be extrapolated as required to more pipeline stages: Fig. 3 shows that each stage costs 3.1% 7 . The 'assumption' is logically impeccable: the data needed is supplied by an instruction ahead, which will finish before this instruction does and therefore furnish the data while this one is still moving through the pipeline.
8. See Dhrystones table at http://www.roylongbottom.org.uk/dhrystone\ %20results.htm. Table 1 (section at vertical line) against number of stages occupied by the encryption/decryption unit. fields semantics
Legend: r is a register index, k is a 32-bit integer, j is an instruction address increment, '←' is assignment. The function
in the baseline, but 1.7% with hardware optimisation.
FxA Instruction Set
Standard instruction sets are insecure for encrypted working (recall the chosen instruction attack of 2.7), but the 'one instruction' HEROIC instruction set turns out to be safe.
Denote by a fused anything and add (FxA) instruction set one where arithmetic instructions always add embedded constants −k 1 , −k 2 to operands x 1 , x 2 and an embedded constant k 0 to the result. So FxA multiplication does
An FxA instruction set for encrypted working is shown in Table 6 . Instructions such as addition need only one constant, as
HEROIC's instructions are a subset. The processor hardware, in addition to what is stated in 4.5, enforces the following conditions described in [3] on interactions via the ABI; (1) Each instruction's action is a black box; (2) each user mode instruction is observed to read and write data in encrypted form; (3) instructions support FxA semantics as described above; (4) The supporting argument in [3] depends on the operator not being able to interpret anything from changes in encrypted data or instruction constants. However, HEROIC's one-toone encryption maps collisions to equalities underneath the encryption, invalidating the assumption. The objection is met by an obfuscating compiler described in [3] that causes the data under the encryption to vary after recompilation. 'The (obfuscating) compiler did it' is a valid cover for runtime cipherspace collisions. The compiler uses the FxA instructions to vary the runtime data at location l, line p by a different offset each time the source code is recompiled. The proviso is because the variations where control paths meet must be equal, or computation would not work.
For example, the paradigmatic Ackermann function [37] compiles to FxA code that runs with the trace shown in Table 7 for arguments (3, 1) . Although the source contains only the constants 0, 1, the trace illustrates well that instructions have been emitted with random embedded constants (the decrypted form is shown in the table, with E[-] indicating encryption). The trace shows that in consequence (encrypted) 'random' data values are written to registers before the return value (encrypted) 13 is written. Fact 2 formally implies semantic security of runtime data from the operator [38] . I.e., no attack does better than guessing.
It is planned for the KPU pipeline to split FxA instructions into an internal stream of OpenRISC μ-instructions.
Conclusion
This paper aims to communicate to the secure hardware community that encrypted working in near conventional processor designs is a real possibility. Our simple superscalar pipelined 32-bit OpenRISC KPU prototype architecture is described, but computer engineers should be able to apply the design principle more generally: it is that an appropriately modified arithmetic generates encrypted working.
AES-encrypted computing benchmarks like a 433 MHz Pentium with our prototype running a 1 GHz clock. An 'FxA' modified RISC instruction set has been introduced in which every program and trace may be interpreted arbitrarily. In conjunction with an 'obfuscating' compiler as briefly described, it formally makes encrypted computing as safe mathematically as the encryption key is physically.
