Abstract-Root-of-Trust (RoT) establishment ensures either that the state of an untrusted system contains all and only content chosen by a trusted local verifier and the system code begins execution in that state, or that the verifier discovers the existence of unaccounted for content. This ensures program booting into system states that are free of persistent malware. An adversary can no longer retain undetected control of one's local system. We establish RoT unconditionally; i.e., without secrets, trusted hardware modules and instructions, or bounds on the adversary's computational power. The specification of a system's chipset and device controllers, and an external source of true random numbers, such as a commercially available quantum RNG, is all that is needed. Our system specifications are those of a concrete Word Random Access Machine (cWRAM) model -the closest computation model to a real system with a large instruction set.
I. INTRODUCTION
Suppose a user has a trustworthy program, such as a formally verified micro-kernel [37] or a micro-hypervisor [73] , and attempts to boot it into a specific system state. The system state comprises the contents of all processor and I/O registers and random access memories of a chipset and peripheral device controllers at a particular time; e.g., before boot. If any malicious software (malware) can execute instructions anywhere in the system state, the user wants to discover the presence of such malware with high probability.
This goal has not been achieved to date. System components that are not directly addressable by CPU instructions or by trusted hardware modules enable malware to survive in non-volatile memories despite repeated power cycles, secureand trusted-boot operations [56] ; i.e., malware becomes persistent. For example, persistent malware has been found in the firmware of peripheral controllers [15] , [43] , [67] , network interface cards [16] , [17] , disk controllers [5] , [48] , [60] , [77] , USB controllers [2] , as well as routers and firewalls [5] . Naturally, persistent malware can infect the rest of the system state, and thus a remote adversary can retain long-term undetected control of a user's local system. Now suppose that the user attempts to initialize the local system state to content that s/he chooses; e.g., malwarefree code, or I/O register values indicating that the system is disconnected from the Internet. Then, the user wants to verify that the system state, which may have been infected by malware and hence is untrusted, has been initialized to the chosen content.
Root of trust (RoT) establishment on an untrusted system ensures that a system state comprises all and only content chosen by the user, and the user's code begins execution in that state. All implies that no content is missing, and only that no extra content exists. If a system state is initialized to content that satisfies security invariants and RoT establishment succeeds, a user's code begins execution in a secure initial state. Then trustworthy OS programs booted in a secure initial state can extend this state to include secondary storage and temporarily attached (e.g., USB) controllers. If RoT establishment fails, unaccounted for content, such as malware, exists. Hence, RoT establishment is sufficient for (stronger than) ensuring malware freedom and necessary for all software that needs secure initial states, such as access control and cryptographic software. However, as with secure and trusted boot, the trustworthiness of the software booted in secure initial states is not a RoT establishment concern.
Unconditional Security. In this work we establish RoT unconditionally; i.e., without secrets, trusted hardware modules and special instructions (e.g., TPMs [71] , ROMs [18] , [31] , SGX [14] ), or polynomial bounds on an adversary's computing power. By definition, a solution to a security or cryptography problem is unconditional if it depends only on the existence of physical randomness [10] and the ability to harvest it [30] , [59] . Unconditional security solutions have several fundamental advantages over conditional ones. For example:
• they are independent of any security mechanism, protocol, or external party whose trustworthiness is uncertain; e.g., a mechanism that uses a secret key installed in hardware by a third party depends on the unknowable ability and interest of that party to protect key secrecy.
• they limit any adversary's chance of success to provably low probabilities determined by the defender; i.e., they give a defender undeniable advantage over the adversary.
• they are independent of the adversary's computing power and technology used; e.g., they are useful in post-quantum computing.
In unconditional RoT establishment all the user needs is an external source of non-secret physical randomness, such as one of the many commercially available quantum random number generators, and correct system specifications. That correct system specifications are indispensable for solving any security and cryptography problem has been recognized for a long time.
As security folklore paraphrases a well-known quote [76] : "a system without specifications cannot be (in)secure: it can only be surprising." For RoT establishment, specifications are necessarily low-level: we need a concrete Word Random Access Machine (cWRAM) model of computation (viz., Appendix A), which is the closest model to a real computer system. It has a constant word length, up to two operands per instruction, and a general instruction-set architecture (ISA) that includes I/O operations and multiple addressing modes. It also supports multiprocessors, caches, and virtual memory. Contributions and Roadmap. We know of no other protocols that establish RoT provably and unconditionally. Nor do we know any other software security problem that has been solved unconditionally in any realistic computational model. This paper is organized as follows.
Requirements Definition (Section II). We define the requirements for RoT establishment, and provide the intuition for how to jointly satisfy them to establish malware-free states and then RoT. In Section VIII we show that these requirements differ from those of past attestation protocols; i.e., some are stronger and others weaker than in past software-based [7] , [39] , [63] , [64] , [66] , cryptographic-based [8] , [18] , [21] , [31] , [38] , [53] , and hybrid [43] , [78] attestation protocols.
New Primitive for establishing malware-free states (Section IV). To support establishment of malware-free system states, we introduce a new computation primitive with optimal space (m)-time (t) bounds in adversarial evaluation on cWRAM, where the bounds can scale to larger values. The new primitive is a randomized polynomial, which has k-independent uniform coefficients in a prime order field. It also has stronger collision properties than a k-independent (almost) universal hash function when evaluated on cWRAM. We use randomized polynomials in a new verifier protocol that assures deterministic time measurement in practice (Section VI). Preliminary measurements (Section VII) show that their performance is practical on commodity hardware even for very large k; i.e., k = 64.
RoT establishment (Section V). Given malware-free states, we provably establish RoT and provide secure initial states for all software. This requirement has not been satisfied since its identification nearly three decades ago; e.g., see the NSA's Trusted Recovery Guideline [51] , p. 19, of the TCSEC [50] .
Optimal evaluation of polynomials (Section III). We use Horner's rule to prove concrete optimal bounds of randomized polynomials in the cWRAM. To do this, we prove that a Horner-rule program is uniquely optimal whenever the cWRAM execution space and time are simultaneously minimized. This result is of independent complexity interest since Horner's rule is uniquely optimal only in infinite fields [9] but is not optimal in finite fields [35] .
II. REQUIREMENTS DEFINITION
To define the requirements for RoT establishment we use a simple untrusted system connected to a trusted local verifier.
Suppose that the system has a processor with register set R and a random access memory M . The verifier asks the system to initialize M and R to chosen content. Then the verifier sends a random nonce, which selects C nonce from a family of computations C m,t (M, R) with space and time bounds m and t, and challenges the system to execute computation C nonce on input (M, R) in m words and time t. Suppose that C m,t is space-time (i.e., m-t) optimal, result C nonce (M, R) is unpredictable by an adversary, and C nonce is non-interruptible. If C m,t is also second pre-image free and the system outputs result C nonce (M, R) in time t, then after accounting for the local communication delay, the verifier concludes that the system state (M , R) contains all and only the chosen content. Intuitively, second pre-image freedom and m-t optimality can jointly prevent an adversary from using fewer than m words or less time than t, or both, and hence from leaving unaccounted for content (e.g., malware) or executing arbitrary code in the system.
When applied to multiple device controllers, the verifier's protocol must ensure that a controller cannot help another undetectably circumvent its bounds by executing some part of the latter's computation; e.g., act as an on-board proxy [43] .
A. Adversary
Our adversary can exercise all known attacks that insert persistent malware into a computer system, including having brief access to that system to corrupt software and firmware; e.g., an extensible firmware interface (EFI) attack [52] by an "evil maid." Also, it can control malware remotely and extract all software secrets stored in the system via a network channel. Malware can read and write the verifier's local I/O channel, but does not have access to the verifier's device and external source of true random numbers.
For unconditional security, we assume that the adversary can break all complexity-based cryptography but cannot predict the true random numbers received from the verifier. Also, the adversary's malware can optimize C m,t 's code on-the-fly and at no cost; e.g., without being detected by the verifier. Furthermore, the adversary can output the result of a different computation that lowers t or m, or both, while attempting to return a correct C nonce (M, R) result.
B. Code Optimality in Adversary Execution
Concrete-Optimality Background. Recall that a computation's upper time and space bounds are given by an algorithm for that computation whereas the lower bounds are given by a proof that holds for all possible algorithms for it. An algorithm is space-time optimal if its bounds match the space and time lower bounds of its computation.
Note that a verifier can use neither C m,t computations that have asymptotic lower bounds nor ones that have only theoretical ones; i.e., bounds that cannot be matched by any program, as illustrated below. If C m,t 's lower bounds are asymptotic, a verifier can never prove that an adversary is unable to find an algorithm with better concrete bounds, by improving the constants hidden in the asymptotic characterizations. If the verifier measures the computation time against a theoretical lower bound, it returns 100% false positives and renders verification useless. If it measures time against a value that exceeds the theoretical lower bound, it can never prove that an adversary's code couldn't execute faster than the measured time, which renders verification meaningless. If the memory lower bound is theoretical and the adversary can exercise space-time (m-t) trade-offs, a time measurement dilemma may arise again: if m is scaled up to a practical value, t may drop to a theoretical one.
A verifier needs C m,t algorithms with concrete (i.e., nonasymptotic) space-time optimal bounds in realistic models of computers; e.g., models of general ISAs, caches and virtual memory, and instruction execution that accounts for I/O and interrupts, multiprocessors, pipelining. If such algorithms are available, the only verifier challenge is to achieve precise space-time measurements, which is an engineering, rather than a basic computation complexity, problem; viz., Section VI. In practice, finding such C m,t algorithms is far from a simple matter. For example, in Word Random Access Machine (WRAM) models, which are closest to real computers (e.g., Appendix A), the lower bounds of even simple computations such as static dictionaries are asymptotic even if tight [1] , [49] . For more complex problems, such as polynomial evaluation, lower bounds in WRAM have been purely theoretical. That is, they have been proved in Yao's cell (bit) probe model [75] , where only references to memory cells are counted, but not instruction execution time. Hence, a WRAM program can never match these lower bounds 1 ; see Related Work, Section VIII. Concretely optimal algorithms exist for some classic problems in computation models that are limited to very few operations; e.g., Horner's rule for polynomial evaluation. However, lower bounds in such models do not hold in a WRAM model with a general ISA or a real processor. For instance, lower bounds for integer gcd programs obtained using integer division (or exact division and mod [47] ) can be lowered in modern processors where an integer division by a known constant can be performed much faster by integer multiplication [27] , [32] ; also a right shift can replace division by a power of two. Furthermore, a C m,t program must retain its optimality when composed with system code; e.g., initialization and I/O code. Its lower bounds must not be invalidated by the composition.
Adversary execution. Most optimality results assume honest execution of C m,t code. An execution is honest if the C m,t code is fixed before it reads any variables or input nonce, and returns correct results for all inputs. Unfortunately, the optimality in honest execution does not necessarily hold in adversarial execution since an adversary can change C m,t 's code both before and after receiving the nonce, or simply guess the C nonce (M, R) result without executing any instructions. For example, the adversary can encode a small nonce into immediate address fields of instructions to save register space and instruction execution. More insidiously, an adversary can change C m,t 's code and nonce to that of C m ,t and nonce where (C nonce , M , R ) = (C nonce , M, R), such that C nonce (M , R ) = C nonce (M, R) and t < t, m = m or t = t, m < m or t < t, m < m. If the adversary can output correct result C nonce (M , R ) with only low probability over the choices of nonce, we say that result C nonce (M, R) is unpredictable. Otherwise, the adversary wins.
The adversary can also take advantage of the optimal code composition with initialization and I/O programs. For instance, if the input of C m,t 's variables and nonce requires multiple packets, the adversary can pre-process input in early packet arrivals and circumvent the lower time and/or space bounds; viz., end of Section III for an example. Also, in a multidevice system, a device can perform part of the computation of another device and help the latter undetectably circumvent its optimal bounds, as illustrated below.
C. Verifier Protocol Atomicity in Adversary Execution
The verifier's protocol begins with the input into the system and ends when the verifier checks the system's output; i.e., result-value correctness and timeliness. Protocol atomicity requires integrity of control flow across the instructions of the verifier's protocol with each system device; i.e., each device controller and the (multi)processor(s) of the chipset. Asynchronous events, such as future-posted interrupts, hardware breakpoints on instruction execution or operand access [39] , and inter-processor communication, can violate control-flow integrity outside of C m,t 's code execution. For instance, malware instructions in initialization code can post a future interrupt before the verifier's protocol begins execution. The interrupt could trigger after the correct and timely C nonce (M, R) result is sent to the verifier, and its handler could undetectably corrupt the system state [42] . Clearly, optimality of C m,t code is insufficient for control-flow integrity. Nevertheless, it is necessary: otherwise, a predictable C nonce (M, R) result would allow time and space for an interrupt-enabling instruction to be executed undetectably.
Verifiable control flow. Instructions that disable asynchronous events must be executed before the C m,t code. Their execution inside C m,t code would violate optimality bounds, and after C m,t would be ineffective: asynchronous events could trigger during the execution of the last instruction. However, verification that an instruction is located before C m,t code in memory (e.g., via computing digital signatures/MACs over the code) does not guarantee the instruction's execution. The adversary code could simply skip it before executing C m,t 's code. Hence, verification must address the apparent cyclic dependency: on the one hand, the execution of the eventdisabling instructions before C m,t code requires control-flow integrity, and on the other, control-flow integrity requires the execution of those instructions before C m,t code.
Concurrent-transaction order and duration. Let a system comprise c connected devices, where device i has random access memory M i and processor registers set R i . Assume for the moment that space-time optimal C m1,t1 , . . . , C mc,tc programs exist and that the control-flow integrity of the verifier's protocol is individually verifiable for each device i. Then the verifier protocol must be transactional: either all C noncei (M i , R i ) result checks pass or the verification fails. In addition, it must prevent two security problems.
First, the protocol must prevent a time gap between the end of the C m i ,t i 's execution and the beginning of C m j ,t j 's, j = i. Otherwise, a time-of-check-to-time-of-use (TOCTTOU) problem arises. A malicious yet-to-be-attested device controller can perform an unmediated peer-to-peer I/O transfer [44] , [45] to the registers of an already verified controller, corrupt system state, and then erase its I/O instruction from memory before its attestation begins 2 . This implies that C m1,t1 , . . . , C mc,tc must 3 . Second, the protocol must assure correct execution order and duration of C m i ,t i programs. That is, the difference between the start times and/or end times of any two programs C m i ,t i and C m j ,t j must be small enough so that neither device i nor j can undetectably perform any computation for the other enabling it to lower its bounds and circumvent attestation. For instance, if the verifier challenges fast device i to start C m i ,t i a lot later than slower device j to start C m j ,t j , device i can execute some of C m j ,t j 's instructions faster, or even act as an on-board proxy [43] , for j. Then device i can undetectably restore its correct C m i ,t i code before its challenge arrives. Or, if C m i ,t i ends well before C m j ,t j ends, malicious device j can act as the verifier and fool attested device i into completing C m j ,t j 's execution faster. (Recall that, even if attested, devices cannot securely authenticate and distinguish unattested-device requests from verifier's requests and deny them.) Note that slower devices can also help faster ones lower their bounds. Nevertheless, the faster C m i ,t i to slower C m j ,t j execution order [43] helps ensure that start-time and end-time differences are small enough.
Scalable bounds. Given an optimal C m,t program, one must be able to obtain other optimal C m i ,t i programs from it, where m i > m, t i > t. Furthermore, given an optimal C m i ,t i program for a fast device i, one must be able to obtain an optimal program C m i ,t i for it, where time bound t i ≥ max(t i ), i = 1, . . . , c, independent of m i ; e.g., to prevent the on-board proxy attacks above. Neither scaling is obvious.
For example, an intuitive scaling of C m,t to C m i ,t i might copy C m,t code k ≥ m i /m times in M i and then challenge the k optimal copies sequentially; and the scaling from
Neither achieves optimal bounds in adversary execution. Consider the second scaling; the first has similar drawbacks. The k executions C nonce0 (M i , R i ), . . . , C nonce k −1 (M i , R i ) must be linked to avoid exploitable time gaps, as noted above. If linking is done by the verifier, C noncej (M i , R i )'s code cannot end its execution until it inputs the next nonce, nonce j+1 , from the verifier [43] . Then C m i ,t i can no longer be optimal, since the variable input-synchronization delays in C m i ,t i invalidate the optimal t i 4 . If the synchronization buffers nonce j+1 , op-timal m i also becomes invalid. The alternate linking whereby nonce j+1 = C noncej (M i , R i ) is inadequate since nonces are no longer random, or even pseudo-random [39] , [63] . Figure 1 summarizes the relationships among requirements for RoT establishment on an untrusted system. The engineering requirements for time-measurement security and a new mechanism that satisfies them are presented in Section VI.
D. Satisfying the requirements -Solution Overview
Individually, the two properties presented below are necessary but insufficient to satisfy the C m,t requirements. However, jointly they do satisfy all of them. 1. k-independent (almost) universal hash functions. The soundness of the verifier's result-value check requires that C m,t is second pre-image free in a one-time evaluation. That is, no adversary can find memory or register words whose contents differ from the verifier's choice and pass its check, except with probability close to a random guess over nonces. Also, inputting the C m,t variables and nonce into an untrusted device must use a small constant amount of storage. k-independent (almost) universal hash functions based on polynomials satisfy both requirements. Their memory size is constant for constant k [12] , [54] and they are second pre-image free. We introduce the notion of randomized polynomials to construct such functions for inputs of d + 1 log p-bit words independent of k; i.e., degree d polynomials over Z p with k-independent, uniformly distributed coefficients; see the Corollary in Section IV-D. 2. Optimal polynomial evaluation. The soundness of the verifier's result-timeliness check requires a stronger property than second pre-image freedom. That is, no computation C m ,t and nonce exists such that C nonce (M , R ) = C nonce (M, R) and either one of its bounds, or both, are lower than C m,t 's in a one-time cWRAM evaluation, except with probability close to a random guess over nonces. Concrete space-time optimality of randomized polynomials in adversary evaluation on cWRAM yields this property; viz., Section IV-D. Its proof is ultimately based on a condition under which a Horner-rule program for polynomial evaluation is uniquely optimal in an honest one-time cWRAM evaluation; see Theorem 1 below.
Why are these combined properties sufficient for RoT establishment? Randomized polynomials enable a verifier to check the integrity of control flow in the code it initializes on an untrusted cWRAM device (Theorem 6). In turn, this helps implement time-measurement security; viz., Section VI. They also assure bounds scalability 5 , which enables the verifier to satisfy the transaction order and duration requirement and leads to the establishment of malware-free states on a multidevice system (Theorem 7). Finally, the verifier uses ordinary universal hash functions to establish RoT in malware-free states (Theorem 8).
III. FOUNDATION: OPTIMAL POLYNOMIAL EVALUATION
In this section we provide the condition under which a Horner-rule program for polynomial-evaluation is uniquely optimal in the concrete WRAM (cWRAM) model, which we use for proving the optimality of randomized-polynomial evaluation in Section IV. We begin with a brief overview of the cWRAM model and illustrate the challenges of proving optimality of universal hash functions in it. A detailed description of cWRAM is in Appendix A.
A. Overview of the cWRAM model
The cWRAM model is a concrete variant of Miltersen's practical RAM model [49] ; i.e., it has a constant word length and at most two operands per instruction. It also extends the practical RAM with higher-complexity instructions (e.g., mod, multiplication), as well as I/O instructions, special registers (e.g., for interrupt and device status), and an execution model that accounts for interrupts. The cWRAM includes all known register-to-register, register-to-memory, and branching instructions of real system ISAs, as well as all integer, logic, and shift/rotate computation instructions. In fact, any computation function implemented by a cWRAM instruction is a finite-state transducer; see Appendix A. (The limit of two operands per instruction is convenient, not fundamental: instructions with higher operand arity only complicate optimality proofs.) All cWRAM instructions execute in unit time. However, floatingpoint instructions are not in cWRAM because, for the same data size, they are typically twice as slow as the corresponding integer instructions in latency-bound computations; i.e, when one instruction depends on the results of the previous one, as in the Horner-rule step below. Thus they cannot lower the concrete space-time bounds of our integer computations. Likewise, non-computation instructions left out of cWRAM are irrelevant for our application.
Like all real processors, the cWRAM has a fixed number of registers with distinguished names and a memory that comprises a finite sequence of words indexed by an integer. Operand addressing in memory is immediate, direct and indirect, and operands comprise words and bit fields.
B. Proving optimality of universal hash functions in cWRAM
The immediate consequence of the constant word length and limit of two single-word operands per instruction is that any instruction-complexity hierarchy based on variable circuit fan-in/fan-out and depth collapses. Hence, lower bounds established in WRAM models with variable word length and number of input operands [1] , [49] , [54] and in branchingprogram models [46] are irrelevant in cWRAM. For example, lower bounds for universal hash functions show the necessity of executing multiplication instructions [1] , [46] . Not only is this result unusable in cWRAM, but proving the necessity of any instruction is made harder by the requirement of unit-time execution for all instructions.
In contrast, concrete space-time lower bounds of cryptographic hash functions built using circuits with constant fanin, fan-out, and depth [3] , [4] would be relevant to cWRAM computations. However, these bounds would have to hold in adversary execution, which is a significant challenge, as seen in Section II-B. Even if such bounds are eventually found, these constructions allow only bounded adversaries and hence would not satisfy our goal of unconditional security.
Since we use polynomials to construct k-independent (almost) universal hash functions, we must prove their concrete optimality in cWRAM evaluations. However, all concrete optimality results for polynomial evaluation are known only over infinite (e.g., rational) fields [9] , and the gap between these bounds and the lower bounds over finite fields (e.g., Z p ) is very large [35] . Furthermore, optimality is obtained using only two operations (i.e., +, ×) and cannot hold in computation models with large instruction sets like the cWRAM and real processors. We address these problems by adopting a complexity measure based on function locality [49] , which enables us to distinguish between classes of unit-time computation instructions, and by providing an evaluation condition that extends the unique optimality of Horner's rule to cWRAM.
C. Unique optimality of Horner's rule in cWRAM
Horner's Rule. Let p be a prime. Polynomial
is evaluated by Horner's rule in a finite field of order p as
Horner-rule step and programs. A program that evaluates a i × x + a i−1 (mod p) as a sequence of four instructions integer multiplication (·), mod p, integer addition (add), mod p in cWRAM, or mod(add(mod(·(a i , x), p), a i−1 ), p) in infix notation, is called the Horner-rule step. If arithmetic is in mod 2 w−1 where w − 1 bits represent an unsigned integer value of a w-bit word, the Horner-rule step 6 simplifies to the multiply-add sequence; i.e., add(·(a i , x), a i−1 ).
A cWRAM loop that executes a Horner-rule step d times to evaluate P d (x) is a Horner-rule program. Note that there may be multiple encodings of a Horner-rule program that evaluate P d (x) in the same space and time.
Initialization. If p < 2 w−1 , the initialization of a Hornerrule program is a cWRAM structure of d + 11 storage words comprising 6 instructions (i.e., 4 for a Horner-rule step and 2 for loop control) and d + 5 data words; i.e., d, a i (0 ≤ i ≤ d), x, p, and output z. If arithmetic is mod 2 w−1 , the structure has d + 8 storage words; i.e., 4 instructions and d + 4 words.
One time, honest evaluation. Polynomial P d (x) is evaluated one time if nothing is known about coefficients a i and input x before the evaluation starts; i.e., a i and x are variables. The evaluation of P d (x) is honest if its program code is fixed before constants are assigned to variables a i and x (i.e., before constants are input and initialized in cWRAM memory) and returns correct results for all inputs. In a dishonest (e.g., adversarial) evaluation, program code can be changed after x or any a i become known; e.g., if x = 0, P d (0) = a 0 can be output without code execution. Theorem 1. Let w > 3 be an integer, 2 < p < 2 w−1 a prime, and
The honest one-time evaluation of P d (x) by a Horner-rule program is uniquely space-time optimal whenever the cWRAM execution time and memory are simultaneously minimized; i.e., no other programs can use fewer than both d + 11 storage words and 6d time units after initialization.
The proof of this theorem and of all others are in Appendix B of the technical report [26] . Briefly, since a Hornerrule program provides the upper bounds, we only need to prove the lower bounds that match them in cWRAM. To prove the lower bounds, we use finite field properties, linear polynomials over Z p , locality-based cWRAM instruction complexity, and the two-operand per instruction limit. First, we show that a four-instruction Horner-rule step is optimal when the cWRAM evaluation space and time for linear polynomials are simultaneously minimized. Then, we use the facts that the evaluation is one-time and honest to show that a Horner-rule step is uniquely optimal. Finally, we define a permutation polynomial of degree d as a special composition of linear polynomials, and show that its evaluation requires a unique two-instruction loop-control sequence that must iterate d times over the Horner-rule step.
A similar proof holds over F q when q > 2 is a prime power. To illustrate, we outline it for the important case q = 2 w−1 . Here the Horner-rule program needs only d + 8 words and 4d time units after initialization.
Theorem 1 answers A. M. Ostrowski's 1954 questions regarding the optimality of Horner's rule [9] in a realistic model of computation. However, both bounds t = 6d and m = d+11 depend on d, and thus t cannot scale independently of m. If t needs to be large, d becomes large. Hence not all d+1 coefficients of P d could always be input at the same time; e.g., in one packet. This would enable an adversary's code to pre-process the coefficients that arrive early and circumvent the optimal bounds; e.g., with pre-processing, the lower bound for
IV. RANDOMIZED POLYNOMIALS AND MALWARE FREEDOM In this section we define a family of randomized polynomials, prove their space-time optimality in adversary evaluation on cWRAM (Theorem 5), and show that they have stronger collision-freedom properties than k-independent (almost) universal hash functions in cWRAM (Corollary). These properties enable the verifier to establish control-flow integrity on a single device (Theorem 6), and scale bounds for correct transaction order and duration in a multi-device untrusted system. This helps establish malware-free states (Theorem 7).
A. Randomized Polynomials -Definition
Let p be prime and d > 0, k > 1 integers. A degree-d polynomial over Z p with k-independent (e.g., [12] ), uniformly distributed coefficients
If v d , . . . , v 0 ∈ Z p are constants independent of s i and x, and ⊕ is the bitwise exclusive-or operation, then polynomial
is called the padded 8 randomized polynomial.
Each padding constant v i will be used to represent the least significant log p bits of a memory word i or of a special processor-state register; whereas the k of random numbers (which generate the s i ) will fill the least significant log p bits of all general-purpose processor registers; e.g., see the device initialization in Section IV-E1 below. Theorem 2 below shows that H d,k (·) is second pre-image free, has uniform output, and is k-independent. Everywhere below, $ ← − denotes a uniform random sample. 7 Our notion of randomized polynomial differs from Tarui's [69] as we cannot input variable numbers (i.e., d + 1) of random coefficients. 8 Of course, other padding schemes not based on the ⊕ operation exist, which preserve the k-wise independence and uniform distribution of the padded coefficients.
The [74] .
Below we define the k-independent uniform elements s i for a family of randomized polynomials H in the traditional way [12] , [74] . We use family H in the rest of this paper.
Family H. Let p > 2 be a prime and r j , x 
where v i ⊕ s i is represented by a mod p integer.
Note that H d,k,x (·) ∈ H has properties 1 and 2 of H d,k (·) in Theorem 2 in a one-time evaluation on x $ ← − Z p . The proof of its k-independence is similar to that of part 3.
Notation. For the balance of this paper, p is the largest prime less than 2 w−1 , w > 4. The choices made for the
B. Code optimality in honest evaluation
In this section, we prove the optimal space-time bounds in a honest one-time evaluation of H d,k,x (·). The only reason we do this is to set the bounds an adversary must aim to beat.
Let Horner(H d,k,x (·)) denote a Horner-rule program for the honest one-time evaluation of H d,k,x (·) ∈ H on input string v. That is, Horner(H d,k,x (·)) is implemented by a nested cWRAM loop using the recursive formula Upper bounds. We show that the upper bounds are m = k + 22 storage words and t = (6k − 4)6d execution time units in cWRAM, after variable initialization. By Theorem 1, the inner and the outer loops of Horner(H d,k,x (·)) can be implemented by 6 instructions each. For each of coefficient, v i ⊕ s i , 2 instructions are sufficient whenever word indexing in v is sequential; i.e, an addition for indexing in v and an exclusive-or. The addition is sufficient when d+1 ≤ |v|, where |v| is the number of words comprising memory M and the special processor registers.
By the definition of family H, the operands of these instructions are evident; i.e., k + 8 data words comprising the
and v i 's word index in v. Thus k + 8 data words and 14 instruction words, or k+22 (general-purpose processor register and memory) words, is Horner(H d,k,x (·))'s space bound.
Lower bounds. The upper space-time bounds of H d,k,x (·) are unaffected by the excess memory and register space required by the programs for processor-state (i.e., special processor register) initialization, I/O, and general-purpose register initialization (Init) in cWRAM; see Section IV-E1. However, excess space prevents us from using Theorem 1 to prove the lower bounds since the execution space is no longer minimized. To avoid this technical problem, we assume these programs are space-optimal and memory M contains only the additional k+22 words. We also take advantage of the fact that an honest program does not surreptitiously modify the settings of the special processor registers after its code is committed. The above assumption is only used to simplify the concreteoptimality proof for the honest evaluation of H d,k,x (·). It is unnecessary for the optimality proof of Horner(H d,k,x (·)) code in adversarial evaluation; see Section IV-C. There we use the collision-freedom properties of H d,k,x (·) in cWRAM (e.g., Corollary, Section IV-D) and its uniform distribution of output, which we can avoid here thanks to the assumption made.
Theorem 3 (Optimality in Honest Evaluation). Let M comprise space-optimal processor-state initialization, I/O, and Init code, and k + 22 words. The honest one-time evaluation of H d,k,x (·) on v by Horner(H d,k,x (·)) is optimal whenever the cWRAM execution time and memory are simultaneously minimized; i.e., no other programs can use both fewer than k+ 22 storage words and (6k − 4)6d time units after initialization.
The proof of this theorem follows from Theorem 1, kindependence, and honest one-time evaluation.
Horner(H d,k,x (v)) space = k+22 time = (6k-4)6d
in (m, t) < (k+22, (6k-4)6d) predicts evaluates goal Adv guesses Fig. 2 : Adversary goal and strategy space C. Code optimality in adversary evaluation Adversary Goal. By Theorem 3, the adversary's goal is to output H d,k,x (v) using only m words of storage and time t such that at least one of the lower bounds is lowered; i.e., m < k + 22 and t = (6k − 4)6d, or m = k + 22 and t < (6k − 4)6d, or m < k + 22 and t < (6k − 4)6d. We denote this goal by (m, t) < (k + 22, (6k − 4)6d).
Strategy Space. We partition the adversary's strategy space into mutually exclusive cases 1 -3 below, which s/he can select at no cost, and bound the probability of success in each case. These cases are summarized in Figure 2 .
← − H and v; i.e., the prediction is a constant relative to the random choices made in
Hence, the probability of adversary's success in a one-time evaluation within bounds (m, t) is is predictable: either the adversary executes at least one Horner-rule step (i.e., at least one outer loop execution) and then outputs the result prediction or s/he executes an entirely different instruction sequence in (m, t).
2.
Horner(H d,k,x (v)) is unpredictable. In this case, the adversary does not execute any Horner-rule step. Instead, s/he chooses a sequence of cWRAM instructions which inputs at least a y ∈ Z p that depends on nonce H d,k,x $ ← − H, executes the sequence, and outputs its result in Z p . That is, the chosen sequence evaluates a function f H (·) : Z p → Z p on an input y and outputs f H (y) in (m, t). Its instructions may read and write multiple values in Z p ; e.g., they may read and modify the values of the general-purpose processor registers, and/or those of v. Since H d,k,x (v) is unknown before f H (y) is output, the adversary's success depends on whether f H (y) = H d,k,x (v).
Note that the execution of any instruction sequence with input and output in Z p represents the evaluation of a unique polynomial Q d (·) of degree d ≤ p − 1 on some input y over Z p . This follows from a well-known fact that establishes the one-to-one correspondence between functions and polynomials Q d (·) in finite fields 9 . Hence, the adversary can always find a pair (Q d (·), y) = (H d,k,x (·), v) whose cWRAM evaluation has desired bounds (m, t) < (k + 22, (6k − 4)6d). To upper bound the probability of adversary's success, we write
are the same values used to generate
denote the adversary's choice of polynomial, input v , and evaluation result output in (m, t). We denote event [S,
. Lemma 4 bounds the adversary's probability of success, P r[S : 
D. Collision Freedom of H in cWRAM
. The corollary below shows not only that H is a family of k-independent (almost) universal hash functions, but also that an adversary is unable to find a function in Z p whose onetime cWRAM evaluation on an input y collides with H d,k,x (v) within bounds (m, t) < k + 22, (6k − 4)6d.
Corollary.
1. H is a k-independent (almost) universal hash function family. (m, t) < (k + 22, (6k − 4)6d) . For a given one-time
Let
Part 1 follows by a similar proof as in Lemma 4, and the k-independence follows along the same lines as the proof of Theorem 2-3. Part 2 follows directly from Theorem 5.
E. Device Initialization and Atomicity of Verifier's Protocol 1) Device Initialization: Upon system boot, the verifier requests each device's boot loader (e.g., akin to U-boot in Section VII) to initialize the device memory with the verifier's chosen content, as described in steps (i) -(v) below, and then transfer control to the first instruction of the processor-state initialization program. The boot loaders may not contain all and only the verifier's chosen code, and hence are untrusted.
i) Processor-state initialization. This is a straight-line program that accesses special processor registers to:
• disable all asynchronous events; e.g., interrupts, traps, breakpoints; • disable/clear caches, disable virtual memory, TLBs 10 , and power off/disable stateless devices;
• set all remaining state registers to chosen values; e.g., clock frequency, I/O registers. When execution ends, the Input program follows in straight line.
ii) Input/Output programs. The Input program busy-waits on the verifier's channel device for input. Once nonce H d,k,x arrives, the Init program follows in straight line. The Output program sends result H d,k,x (v) to the verifier after which it returns to busy-waiting in the boot loader for further verifier input.
iii) Init program. This is a straight-line program that loads the k random values of nonce H d,k,x into the general-purpose processor registers so that no register is left unused; e.g., if 16 registers are available, k = 16. Its execution time, t 0 , is constant since k is constant. When execution ends, the
program. This comprises 14 instructions whenever the address space is linear in physical memory. When execution ends, the Output program follows in straight line and outputs
v) Unused-memory initialization. After the initialization steps (i) -(iv) are performed, the rest of the memory M is filled with verifier's choice of constants.
2) Control Flow Integrity: Recall that the verifier's protocol begins with the input of nonce H d,k,x into a device and ends when the verifier checks the device's output. 3) Concurrent-transaction order and duration: Let a system comprise c devices with the smallest word size of w bits and p < 2 w−1 . Let the verifier request an untrusted boot loader to initialize device i with chosen content v i as described in Section IV-E1. Then Init i initializes the k i general-purpose registers on device i in constant time t 0i . If t 0j + t j is the slowest Horner(H dj ,kj ,xj (·)) execution time on any device's v j , then the verifier selects values of d i , k i for the other device nonces H di,ki,xi such that t i = t 0i + (6k i − 4)6d i equals t 0j + t j , or exceeds it by a very small amount, to satisfy the duration requirement. Then the verifier's chosen concurrenttransaction order can assure that the start times and end times do not allow malicious devices to circumvent lower bounds.
Theorem 7 (Malware-free state). Let a verifier initialize an untrusted c-device system to
, where c is small; i.e., 10c << p. Then the verifier challenges the devices concurrently in transaction order, with device i receiving nonce H di,ki,xi whose t i satisfies the duration requirement. If the verifier receives result H di,ki,xi (v i ) at time t i for all i, the probability that malware survives in a system state is at most 9c p . If the verifier runs the protocol n times, the malwaresurvival probability becomes negligible in n; i.e., (n) = [
The proof follows directly from the concurrent transaction order and duration property of the verifier's protocol, Theorem 6, and Lemma 4.
Example. For w = 32 and w = 64 bits, the largest primes p < 2 w−1 are 2 31 − 1 and 2 63 − 25. In practice c ≤ 16 as we rarely encounter commodity computer systems configured with more than eight CPU cores and eight peripheral-device controllers whose non-volatile memories can be re-flashed with code 11 . For w = 32 (w = 64), the probability of malware survival for n = 1 is less than 2 −23 (2 −55 ), for n = 2 is less than 2 −46 (2 −110 ), etc. Hence, n ≤ 2 is sufficient, in practice. 11 Although GPUs have many cores, GPU malware is cannot persist, as it cannot survive GPU power-offs/reboots [62] by processor-state initialization.
V. UNCONDITIONAL ROOT OF TRUST ESTABLISHMENT Theorem 7 establishes a malware-free, multi-device system state. However, this is insufficient to establish RoT. While the general-purpose registers contain w-bit representations of the k random numbers, the memory and special processor registers of a device comprise w-bit words, rather than the log p-bit fields v i di , . . . , v i 0 words, where p < 2 w−1 is the largest prime. Hence, a sliver of unaccounted for content exists.
To establish RoT, the verifier can load a word-oriented (almost) universal hash function in each malware-free device memory and verify the results they return after application to memory and special processor register content. Note that space-time optimality of these hash functions and verifier's protocol atomicity are unnecessary, since malware-freedom is already established. A pairwise verifier -device i protocol checking device memory and special register content is sufficient. Let H w be such a family and V comprise the set of w-bit words of a device's memory and special processor registers.
Fact (e.g., Exercise 4.4 [72] ). Let q > 2 w be a prime, |V | = q/2 w , and a, b, c $ ← − Z q be the function index of family H w , where
is a family of almost universal hash functions, with collision probability of 2 −(w−1) . The probability is computed over the
Theorem 8 (RoT Establishment)
. Let a verifier establish a malware-free state of a c-device system in n protocol runs, as specified in Theorem 7. Then let the verifier load H ai,bi,ci (·) $ ← − H w on device i and check each result H ai,bi,ci (M i ) received. If all checks pass, the verifier establishes RoT with probability at least
n ; e.g., higher than 1 − 10c p for n = 1.
The proof is immediate by Theorem 7 and the Fact above.
Implementation considerations of the cWRAM model in real processors for suitable choices of prime p are discussed in Appendix C of the technical report [26] .
Secure Initial State. After the verifier establishes RoT, it can load a trustworthy program in the system's primary memory. That program sets the contents of all secondary storage to verifier's choice of content; i.e., content that satisfy whatever security invariants are deemed necessary. This extends the notion of the secure initial state to all system objects.
VI. TIME-MEASUREMENT SECURITY
Past software-based attestation designs fail to assure that a verifier's time measurements cannot be bypassed by an adversary. For example, to account for cache, TLB, and clock jitter caused primarily by pseudo-random memory traversals by C m,t (·) computations and large t, typical verifiers' measurements build in some slack time; e.g., 0.2% -2.6% of t [39] , [42] , [43] , [63] . An adversary can easily exploit the slack time to undetectably corrupt C m,t (·)'s memory [39] , [42] . In this section we show how to counter these threats.
A. Verifier Channel
The verifier's local channel must satisfy two common-sense requirements. First, the channel connection to any device must not pass through a peripheral device controller that requires RoT establishment. Otherwise, malware on the controller could pre-process some of the computation steps for the verifier's protocol with that device and help it to circumvent the time measurements. Second, the channel's delay and its variation must be small enough so that the verifier time measurements can reliably detect all surreptitious untrusted-system communication with external devices and prevent both memorycopy [42] and remote-proxy [43] attacks.
We envision a verifier device to be attached to the main system bus via a DMA interface, similar in spirit to that of Intel's Manageability Engine or AMD's Platform Security Processor, but without flaws that would enable an attacker to plant malware in it [52] . These processors can operate independently of all other system components; e.g., even when all other components are powered down [67] . The external verifier could also run on a co-processor connected to the main system bus, similar in spirit to Ki-Mon ARM [41] . In both cases, the verifier would have direct access to all components of the system state. An advantage of such verifiers is that their communication latency and variation of the local channel are imperceptible in contrast with the adversary's network channel.
B. Eliminating Cache and TLB jitter
To perform deterministic time measurement, it is necessary to eliminate cache/TLB jitter and interprocessor interference, and avoid clock jitter in long-latency computations.
Preventing Cache, Virtual Memory, and TLB use. In contrast with traditional software-based attestation checksums (e.g., [39] , [42] , [63] , [64] ), the execution-time measurements of Horner(H d,k,x (v)) is deterministic. Most modern processors, such as the TI DM3730 ARM Cortex-A8 [6] , include cache and virtual-memory disabling instructions. Hence, processor-state initialization can disable caches, virtual memory, and the TLB verifiably (by Theorem 6). In addition, the Horner-rule step is inherently sequential and hence unaffected by pipelining or SIMD execution. The only instructions whose execution could be overlapped with Horner-rule steps are the two loop control instructions, and the corresponding timing is easily accounted for in the verifier's timing check.
Preventing Cache pre-fetching. In systems where caches cannot be disabled, the inherent sequentiality of Horner(H d,k,x (v)) code and the known locality of the instruction and operand references helps assure that its execution-time measurements are deterministic. However, the adversary's untrusted boot loader could perform undetected cache pre-fetches before the verifier's protocol starts, by selectively referencing memory areas, and obtain better timing measurements than the verifier's; viz., Section VII. To prevent pre-fetching attacks the processor-state initialization can clear caches verifiably (by Theorem 6), so that Init and Horner(H d,k,x (v)) code can commence execution with clean caches. Hence, cache jitter can be prevented.
Alternately, the verifier's processor-state initialization could warm up caches [63] by verifiable pre-fetching. Nevertheless, verifiable cache clearing is often required; e.g., in ARM processors instruction and data caches are not hardware synchronized, and hence they have to be cleared to avoid malware attacks [42] . Furthermore, cache anomalies may occur for some computations where a cache miss may result in a shorter execution time than a cache hit because of pipeline scheduling effects [19] . This makes cache clearing a safer alternative.
C. Handling clock jitter and inter-processor interference
When Horner(H d,k,x (v)) executes in large memories it can have large latencies; e.g., several minutes. These may experience small time-measurement variations in some systems due to uncorrected random clock jitter at high frequencies [68] , and multi-processor interference in memory accesses. These timing anomalies are typically addressed in embedded realtime systems [19] . For such systems, we use a random sequential protocol. This protocol leverages smaller memory segments and the verifiable choice of clock-frequency setting such that random clock jitter becomes unmeasurable by an adversary. It also ensures that different processors access different memory segments to eliminate interprocessor interference. The protocol also provides an alternate type of bounds scaling.
Random Sequential Evaluation. Let F = {f 1 , f 2 , . . . , f n } be a family of n functions and
. . , N , be identifiers of their random invocations. f K1 , f K2 , . . . , f K N are evaluated on inputs x 1 , x 2 , . . . , x N , and ⊥ denotes the event that an invalid result is returned by a function evaluation. The protocol for the random sequential evaluation of F , namely
The evaluation terminates correctly if f Ki (x i ) =⊥ for all i, and incorrectly, otherwise.
Condition 1) implies that the evaluation invokes all randomly selected functions with high probability at least once [20] , [63] . Condition 2) defines the sequential evaluation rule. Condition 3) implies that the j-th function evaluation is independent from the previous i < j evaluations.
Verifier Initialization. Let the verifier request the boot loader to initialize M to n memory segments each comprising processor-state initialization, I/O, Init, and Horner(H d,k,x (·)) programs. Then verifier's boot loader transfers control to the first instructions of the processorinitialization program.
Verifier Protocol. Let F be family H, f Ki be
, and H di,ki,xi (·) $ ← − H; i.e., the random selection of a memory segment 12 . If the random sequential evaluation protocol terminates incorrectly or the termination is untimely, or both, the verifier rejects. Otherwise, the verifier accepts. This is the verifier's protocol for the n-segment memory model.
Specifically, the verifier writes the values denoting the choice of H di,ki,xi (·) $ ← − H separately to each of the n memory segments. Furthermore, the verifier's Output code is modified so that it returns to the Input busy-waiting code after outputting an evaluation result, which transfers to the first instruction of the Input code of the next randomly chosen segment. The address of the next segment's Input code is provided by the verifier along with the next nonce H di,ki,xi (·)
In a multiprocessor system where t processors share RAM memory M , the Init programs would start the concurrent execution of all t processors in different memory segments along with those of the device controllers.
Theorem 9 (Malware-free Segmented Memory). Let a verifier initialize memory M of a (e.g., multiprocessor) device to n segments and perform the verifier's protocol for the segmented memory. If the verifier accepts the result, the device state is malware-free, except with probability at most VII. PERFORMANCE In this section, we present preliminary performance measurements for the Horner-rule evaluation of randomized polynomials. The only goal here is to illustrate implementation practicality on a commodity hardware platform. For this reason, we compare these measurements to those of Pioneer -the best-known attestation checksum [63] -on the same hardware configuration [42] . Presenting a study of randomizedpolynomial performance is beyond the scope of this paper.
Our measurements also illustrate the importance to provably clearing (or disabling, when possible) caches for deterministic time measurements. We noticed no timing anomalies due to uncorrected clock jitter in our single-processor configuration for a fairly large memory. This suggests that the random sequential evaluation for large memories (Section VI) may be useful primarily to prevent inter-processor interference.
Hardware. Our measurements were done on a Gumstix Overo FireSTORM-P Computer-On-Module (COM), which is the ARM-based development platform for embedded hardware used by Li et al. [42] . This gives us an opportunity to compare the performance of Horner's rule for randomized polynomials with that of the Pioneer checksum. This platform features a 1GHz Texas Instruments DM3730 ARM Cortex-A8 32-bit processor and 512MB of DDR SDRAM [70] . The processor has a 32KB L1 instruction cache and a 32KB L1 data cache, both with 16 bytes per cache line. In addition, it also features a 256KB L2 unified cache [6] .
Recall that the parameter |M | must reflect the total amount of primary storage in the device. Besides the 512MB of SDRAM, our particular Gumstix also features 64KB of SRAM and also a large address space for device control registers with 5, 548 registers. Summing these up as bits, we set |M | to 4, 295, 669, 120.
Software. Our measurements are implemented inside a popular secondary boot loader known as U-Boot, which in a typical application would be responsible for loading the Linux kernel on the COM. For our purpose, however, we extend UBoot with measurement capabilities; i.e., U-Boot 2015.04-rc4 is cross-compiled with Linaro gcc 4.7.3.
We implemented Horner's rule for several polynomials in Z p , where p = (2 32 − 5) is the largest prime that can fit inside a 32-bit register. Since the DM3730 ARM Cortex-A8 CPU does not support the udiv (unsigned integer division) instruction, gcc uses the __aeabi_uidivmod function to emulate division, which is slower than the hardware udiv instruction followed by the mls (integer multiply-and-subtract) instruction to compute the modulus. Nevertheless, an adversary cannot change the emulation since the code image is committed by the second pre-image freedom of randomized polynomials.
The first Horner-rule measurement is for ordinary polynomials; i.e., with constant, rather than k-wise independent, coefficients. This establishes the baseline, which helps calibrate the expected performance loss for increasing the values of k. The performance of Horner rule for a single polynomial of degree 128M covering the entire SDRAM is 11, 739ms.
For the measurements of Horner-rule evaluation of randomized polynomials, the k random numbers are stored contiguously in memory. For values of k that match one cache line, namely k = 4, evaluating a polynomial of degree d = 128M (same as the baseline) takes 67, 769ms due to extra memory accesses and added cache contention. However, most modern processors have more than k = 4 and fewer than k = 64 registers. Hence, larger values of k would have to be used to ensure that the adversary cannot be left with spare processor registers after loading the k random numbers.
Randomized Polynomials vs. Pioneer Checksum. The timing for k = 64 and d = 10M is 54, 578ms. For the baseline d = 128M the running time is close to 700 seconds, which is about 6% faster than the fastest Pioneer checksum (745.0 seconds), 8.7% faster than the average (765.4 seconds), and 9% faster than the slowest (768.1 seconds) reported by Li al. [42] on the same hardware configuration. While these measurements illustrate practical usefulness, additional measurements are necessary for a complete performance study; e.g., additional hardware platforms and configurations.
Why Disable or Clear Caches? Instruction and data caches on the DM3730 ARM Cortex-A8 can be disabled and enabled individually, using single instructions. We used this feature to illustrate the inferior cache utilization compared to an adversary's cache pre-fetching strategy. With only the instruction (data) cache turned off, we observed a 5.15x (23.76x) slowdown in Horner-rule evaluation. With both caches turned off, the slowdown increases to 53.13x. This shows that the adversary can gain a real advantage by cache pre-fetching.
VIII. RELATED WORK

A. Past Attestation Protocols
Past attestation protocols, whether software-based [7] , [33] , [39] , [63] , [64] , [66] , cryptographic-based [8] , [18] , [21] , [31] , [38] , [53] , or hybrid [43] , [78] , have different security goals than those of RoT requirements defined here: some are weaker and some are stronger. For example, whether these protocols are used for single or multiple devices, they typically aim to verify a weaker property, namely the integrity of software -not system -state. However, they also satisfy a stronger property: in all cryptographic and hybrid attestation protocols verification can be remote and can be repeated after boot, rather than local and limited to pre-boot time.
Given their different goals, it is unsurprising that past attestation protocols fail to satisfy some RoT establishment requirements defined in Section II even for bounded adversaries and secret-key protection in trusted hardware modules. For example, these protocols need not be concerned with the content of system registers (e.g., general processor and I/O registers), since they cannot contain executable code. Furthermore, they need not satisfy the concurrent-transaction order and duration requirements (see Section II-C) of the verifier's protocol since they need not establish any system state properties, such as secure initial state in multi-device systems. Finally, none of these protocols aims to satisfy security properties provably and unconditionally. Beyond these common differences, past protocols exhibit some specific differences.
Software-based attestation. Some applications in which software-based attestation can be beneficially used do not require control-flow integrity [58] , and naturally this requirement is not always satisfied [11] , [42] . A more subtle challenge arises if one uses traditional checksum designs with a fixed time bound in a multi-device system since scalable time bounds are important. As shown in Section II-C, these checksums cannot scale time bounds by repeated checksum invocation with different nonces and retain optimality. Software-based attestation models [7] , [33] also face this challenge.
Despite their differences from RoT establishment, software-based attestation designs met their goals [63] , [64] , and offered deep insights on how to detect malware on peripheral controllers [43] , embedded devices [11] , [42] , mobile phones [33] , and special processors; e.g., TPMs [39] .
Cryptographic attestation. Cryptographic protocols for remote attestation typically require a trusted hardware module in each device, which can be as simple as a ROM module [38] , to protect a secret key for computing digital signatures or MACs. If used in RoT establishment, the signature or MAC computations must verifiably establish control-flow integrity. Otherwise, similar control-flow vulnerabilities as softwarebased attestation would arise. Furthermore, the trusted hardware module must protect both the secret key and the signature/MAC generation code.
More importantly, cryptographic attestation relocates the root of trust to the third parties who install the cryptographic keys in each device controller and those who distribute them to verifiers. However, the trustworthiness of these parties can be uncertain; e.g., a peripheral-controller supplier operating in jurisdictions that can compel the disclosure of secrets could not guarantee the secrecy of the protected cryptographic key. Similarly, the integrity of the distribution channel for the signature-verification certificate established between the device supplier/integrator and verifier can be compromised, which enables known attacks; e.g., see the Cuckoo attack [55] . Thus, these protocols can offer only conditional security.
Nevertheless if the risk added when third parties manage one's system secrets is acceptable and protocol atomicity requirements are met, then cryptographic protocols for remote attestation could be used in RoT establishment.
B. Polynomial Evaluation
If the only operations allowed for polynomial evaluation are the addition and multiplication, Horner rule's bound of 2d operations for degree-d polynomials was shown to be uniquely optimal in one-time evaluations [9] , [61] . However, this bound does not hold in finite fields, where the minimum number of modular additions and multiplications is Ω ( (d + 1) ) [35] . Furthermore, these bounds do not hold in any WRAM models or any real computer where many more operations are implemented by the ISA.
For WRAM models with variable word widths, polynomial-evaluation lower bounds are typically obtained in the cell probe model. Here the polynomial is assumed to be already initialized in memory. The evaluation consists of the reading (probing) a number of cells in memory, and after of all read operations are finished, it must output the result. The cell probed by each read operation may be any function of the previously probed cells and read operations, and thus all computations on the already read data take no time.
Using the cell-probe model, Gál and Miltersen [22] showed that the size r of any additional data structure needed for the evaluation of a degree-d polynomial beyond the information theoretical minimum of d + 1 words must satisfy r · t = Ω(d), where t is the number of probes, d ≤ p/(1 + ), p is a prime, and > 0. For linear space data structures (i.e., w-bit words and memory size |M | = O(d·log p/w)), Larsen's lower bound of Ω(log p) is the highest [40] , but it is not close to the lowest known upper bound [36] . Neither bound holds in cWRAM or in a real computer.
IX. CONCLUSIONS
RoT establishment is a necessary primitive for a variety of basic system security problems, including starting a system in a secure initial state [24] , [25] and performing trusted recovery [51] . These problems have not been demonstrably resolved since their identification decades ago. They only became harder in the age of persistent malware attacks. RoT establishment is also necessary for verifiable boot -a stronger notion than secure and trusted boot [23] .
In this paper we showed that, with a proper theory foundation, RoT establishment can be both provable and unconditional. We know of no other software security problem that has had such a solution, to date. Finally, the security of time measurements on untrusted systems has been a long-standing unsolved engineering problem [39] , [42] , [43] , [63] . Here, we also showed that this problem can be readily solved given the provable atomicity of the verifier's protocol.
X. Appendix A -The Concrete Word RAM Model Storage. cWRAM storage includes a fixed sequence M of w-bit memory words indexed by an integer, such that constant w > log|M |. The allocation of each instruction in a memory word follows typical convention: the op code in the low-order bytes and the operands in the higher-order bytes. Furthermore, cWRAM storage also includes k w-bit general-purpose processor registers, R 0 , R 1 , . . . , R k−1 . A memory area is reserved for the memory mapped I/O registers of different devices and the interrupt vector table, which specifies the memory location of the interrupt handlers. The I/O registers include data registers, device-status registers, and device-control registers.
Special Registers. In addition to the program counter (PC), the processor state includes internal registers that contain the asynchronous-event status bits specifies whether these events can be posted or are disabled; e.g., by the events clear or enable instructions. It also includes a set of flags and processor configuration settings (e.g., clock frequency) and specifies whether virtual memory/TLBs and caches are enabled. Instructions to enable and disable cashes/virtual memory are also included. In systems that do not automatically disable cache use when virtual memory is disabled, an internal register containing cache configuration status is provided.
Addressing. Each instruction operand is located either in a separate memory word or in the immediate-addressing fields of instructions. Immediate addressing is applicable only when operands fit into some fraction of a word, which depends on the size of the instruction set and addressing mode fields. Indirect, PC-relative, and bit addressing are also supported.
Instruction Set. The cWRAM instruction set includes all the types of practical RAM instructions [49] with up to two operands. -Unconditional branches: go to g. Branch target g designates either a positive/negative offset from the current program counter, P C, and the branch-target address is P C + g, or a register R k , which contains the branch-target address.
-Conditional branches: for each predicate pred: F 2 w × F 2 w → {0, 1}, where pred ∈ {≤, ≥,=, =}, there is an instruction pred(R i , R j )g, which means if pred(R i , R j ) = 1(true), go to P C + g.
If one of the input registers, say R j , contains a bit mask, there is an instruction pred(R i , mask)g, which means if (R i ∧ mask) = 0, go to P C + g. If R j = 0, there is an instruction pred(R i )g, which means if pred(R i , 0) = 1, go to PC+g.
Note that the predicate set, pred, can be extended with other two-operand predicates so that all known conditionalbranch instructions can be represented in cWRAM.
-Halt: there is an instruction that stops program execution and outputs either the result, when program accepts the input, or an error when the program does not.
