Abstract-We present a HW/SW platform for on-the-fly detection of failures and weaknesses in entropy sources. By splitting the operations between hardware and software, we achieve sufficient flexibility to control the level of significance of the tests. This approach also enables sharing resources between different tests thereby reducing the area and power. Statistical tests were selected from the NIST test suite. We propose several versions of hardware co-processors for monitoring random bit sequences, ranging from 52 slices (5 tests) to 552 slices (9 tests) on Spartan-6 FPGA. We are the first to provide implementations of the Serial test and the Approximate entropy test for on-the-fly monitoring.
I. INTRODUCTION
Random numbers are used in cryptography for generating session keys, nonces, and random challenges in various authentication protocols. These numbers are generated using specialized primitives called true random number generators (TRNGs) which produce unpredictable, uniformly distributed output values.
BSI's standard AIS-31 [1] for TRNG evaluation, requires on-the-fly tests of all TRNGs implemented in hardware. The purpose of these tests is to detect failures or statistical weaknesses of the entropy source. The draft of the special publication by NIST [2] , also requires on-the-fly tests (health tests) for random number generators.
A. Previous Work
Several batteries of tests for statistical evaluation of TRNGs are available, such as FIPS [3] , [4] , NIST [5] and DIEHARD [6] . Some of these tests have been implemented in hardware. Hardware implementations of 4 FIPS tests are proposed in [7] , [8] . Implementations of 2 simple tests from NIST and 4 from DIEHARD suites are presented in [9] and [10] . In [11] , an FPGA implementation based on dynamic reconfiguration was provided. However, the architecture of this design was not suitable for on-the-fly testing. Partially reconfigurable ASIC implementations of 6 NIST tests are presented in [12] . FPGA implementations of 8 different tests from the NIST test suite are presented in [13] . Each test was implemented individually and none of the hardware resources were shared.
As pointed out in [14] , one disadvantage of the embedded on-the-fly tests is the possible effect of the tests on the TRNG behavior. Embedded tests increase the chip activity resulting in more digital noise, which may affect the TRNG in such way to pass the statistical tests. However, if the TRNG is used when tests are not active, less (deterministic) noise is present on the chip. One way to resolve this situation is keeping the tests active all the time while the TRNG is in operation, thereby ensuring that the TRNG always operates under the same conditions for which it was tested. Another disadvantage is vulnerability to fault attacks. In all previous implementations, the embedded tests generate an alarm signal in case of a threat detection. If this signal is connected to ground (for instance due to a probing attack), the failure will not be detected. In order to prevent this type of fault, a different approach is required.
B. Our Contribution
In this paper we present hardware blocks for performing onthe-fly evaluation of randomness. We provide three different versions that operate on bit sequences of different lengths. The random sequence is read bit by bit. The tests were selected from the NIST battery of tests. This test suite is not designed for on-the-fly testing and in general, it is not suitable for this purpose due to the high latency and high computational requirements. However, some of these tests can be optimized for compact hardware implementation and low latency. We have selected those tests from the suite that can be optimized for compact implementation in order to make them suitable for on-the-fly monitoring.
Our first contribution is splitting the calculations between hardware and software in order to enable efficient sharing of resources between different tests in order to minimize the required hardware area. Each test is carried out using two types of operations. The first type consists of operations on the incoming bits such as: counting ones and zeros, finding the maximal longest run of the same value, counting the appearance of a given pattern or keeping track of a random walk. These operations are implemented in hardware using counters, comparators and registers, and they are performed while the TRNG is active.
The second type of operations is used for verifying the randomness hypothesis. These operations, which include addi- The presented hardware blocks analyze the generated sequence and provide the results that do not depend on . Level of significance only figures in the operations performed by the software which can be easily programmed.
Our second contribution is the implementation of the serial test and the approximate entropy test from the NIST test suite. To the best of our knowledge, these are the first hardware implementations of these tests suitable for on-the-fly testing.
Our third contribution is the unified implementation of hardware blocks, which allows for different tests to share resources such as bit counters. In addition, different tests use the same counter values (for example, number of ones in a sequence) so there is no need for these counters to be duplicated. 
C. Organization
This paper is organized as follows. In Section 2, we provide the general background and notation. Section 3, deals with the implementation aspects including the numerical simplifications, splitting the operations between the hardware and software as well as the implementation of the hardware blocks. The results of the FPGA and ASIC implementations, as well as comparison with relevant work are presented in Section 4. Finally, conclusions and proposals for future work are provided in Section 5.
II. BACKGROUND

A. Statistical Tests
Even though randomness is a property of a variable, rather than a sequence of numbers, and therefore cannot be measured like other physical quantities, it is possible to estimate the randomness by checking different statistical properties. The idea behind statistical tests is to start with the hypothesis that the RNG is ideal, we will denote this hypothesis as 0 . A statistical test is used to measure a property of the generated sequence, for example the longest run of zeros. Based on the test result, the hypothesis 0 is either accepted or rejected. If the sequence is indeed random, it is still possible to reject 0 with some small probability. This is known as a type 1 error. The probability of this error is a design parameter called the level of significance and denoted as . The NIST recommendation for the value of is between 0.001 and 0.01. The other type of error that can occur is accepting 0 when the sequence is not random. This is known as a type 2 error, and the purpose of the test is to minimize the probability of this error. The generated sequence should be tested for many different statistical weaknesses in order to accept it as random. As summed up in table I, The NIST suite consists of 15 statistical tests which estimate different properties of a random variable. These tests are originally designed to evaluate the statistical properties of pseudo-random number generators (PRNGs), but they are also used for evaluation of TRNGs. Hardware implementations of these tests can be used for onthe-fly monitoring.
B. On-the-fly Tests
Hardware implementations of TRNGs are susceptible to different active attacks. It is possible to reduce the randomness by changing the operating conditions, such as temperature or 
voltage. Paper [15] demonstrates that manipulating the power supply can cause ring oscillators inside a TRNG to lock to a certain frequency, thereby reducing the generated entropy. A similar attack can be done using an electromagnetic probe as shown in [16] . The trivial way for completely disabling the source of randomness is by cutting the signal wire used for transmitting random bits. In addition to active attacks, a designer needs to worry about failures due to aging. For this reason, different tests are required for on-the-fly monitoring of RNGs: quick tests for fast detection of the total failure of the entropy source, as well as slow tests for the detection of long term statistical weaknesses.
III. IMPLEMENTATION
A. System Design
In embedded systems, random number generators are never used as a stand-alone module, but rather in conjunction with the other components such as embedded processors, microcontrollers or DSPs. For this reason, we can assume that the chip containing an RNG and the HW testing block, also contains a component that can perform basic arithmetic operations. This component can serve as the software platform.
As depicted in Fig. 1 , the embedded system consists of a TRNG, the HW Testing Block, at least one component that performs arithmetic (microcontroller, DSP or a GPU), and possibly other components such as embedded RAM and crypto co-processors. We split the testing implementation into two parts: hardware and software. To reduce the area and power consumption of the HW testing block, it is implemented in a compact manner using only the basic components such as counters, comparators and registers, while all power-hungry arithmetic operations are moved to the software part. Squaring, multiplication, logarithm and comparison with the precomputed constants are performed in software.
The proposed approach allows for a flexibility with respect to the level of significance . Since the implementation of the hardware block doesn't depend on , the software part can be updated in case this value needs to be changed. As pointed out in [14] , on-the-fly tests should be active while the TRNG is working in order to ensure that they are always operating in the same conditions as when they are being tested. The proposed approach allows us to run the hardware block all the time and check the test results only when needed.
B. HW/SW Calculations
All calculations needed for the statistical tests, are divided between a HW testing block and software executed on the SW platform (micro-controller or any processor with instructions for basic arithmetic). The boundary between hardware and software is chosen to minimize the hardware block, i.e. to keep only the necessary parts in hardware. Each test consists of operations that have to be executed while the bit stream is generated (the HW part) and basic arithmetic operations (SW part). It is also important to minimize the amount of data that needs to be transferred from HW to a co-processor in order to simplify the interface between the two modules.
We have selected 9 tests from the NIST test suite which are suitable for this type of implementation, as indicated in table I. The remaining 6 tests either require too much data storage in the HW module which would result in large area, too complex operations in the software part which would result in high latency, or too much data to be transferred between the two modules which would result in an overly complicated interface.
Table II presents how the required operations were divided between HW and SW. The middle column (Hardware), lists all the values that are computed by the HW module and transferred to the co-processor. We used the following notation:
• -the total number of ones. • -number of ones within a block of data.
• -length of a single block of data.
• -number of data blocks.
• -number of runs within a block of data.
• -number of non-overlapping appearances of a given template within a block of data.
• , -the number of data blocks in each category depending on the number of overlapping appearances of a given template.
• -length of a template.
• , , -partial sums obtained by the up/down counter. Maximal, minimal (negative) and the final value are recorded.
All other values are test-specific precomputed constants.
Software routines operate on these obtained data values. As can be seen from the last column of table II, required operations for software are basic multiplication, addition and comparison operations. The only difficulty is implementing the ⋅ ( ) function needed for the approximate entropy test, which will be described in more detail in subsection III-D.
C. HW Implementations
We have implemented several versions of the HW testing block. Fig. 2 shows the implementation of the largest version that contains all 9 tests and operates on a sequence of 2 20 bits. Clock and enable signals are omitted for better clarity. The global bit counter is also not shown. This counter is used to count the total number of bits in order to detect the end of the sequence. A memory-mapped interface is implemented using a large multiplexer, where the 7-bit address is used as a select signal. Since this interface contributes significantly to the overall area we can save resources by reducing the number of transmitted values.
After receiving each random bit from the generator, all update calculations finish within one clock cycle.
This type of unified implementation enables us to share more resources between different tests to obtain higher area reduction. We have used 4 tricks to reduce the area of this module:
• Omitting a redundant counter: Tests 1 and 3 use the total number of 1's in a sequence to compute the test result. However, it is possible to obtain this result without this counter. For the implementation of test 13, an up/down counter is used to keep track of the random walk. The total number of ones can be calculated from the final value of this counter. For this reason, the counter of ones can be omitted.
• Block detection: For the implementations of tests 2, 4, 7 and 8 it is necessary to divide the sequence into sub-blocks and look for certain properties in each block (the total number of ones, longest run of the same bit value, and occurrences of different patterns). We have selected test parameters such that block lengths are equal to powers of 2, which enables us to detect the beginning and the end of each block by simply observing specific bits of the global bit counter.
• Unified implementation: The approximate entropy test (test 12) uses the number of all 4-bit and 3-bit patterns in a sequence. These values are already provided by the serial test (test 11) implementation, therefore there is no need for the separate implementation of test 12.
• Shared shift register: The non-overlapping and the overlapping template match tests compare the generated numbers with the pre-defined 9-bit patterns. The same shift register can be used for both tests.
D. SW Implementations
Typical software implementations of statistical tests operate by computing the P-value, and comparing it with the required level of significance. P-value is the probability that an ideal random number generator produces a sequence which is worse than the measured sequence with respect to the metric used by the test (for example bias, or the longest run of the same bit value). This is a computationally intensive task because calculating P-values requires complicated functions such as erfc and the gamma function. We use a simple approach of computing the inverse functions of the critical value and storing the precomputed constants, thereby skipping the most computationally intensive step. This approach is also used in [9] , [13] , [12] .
As shown in with the critical value. For test 3, the critical values for the are stored in the program memory as constants and they depend on the . The SW procedure first checks the interval where belong and based on the result, compares with the appropriate constant. Similar approach was used for FPGA implementation in [13] .
Other tests require simple calculations on the obtained values before the comparison. Required operations are comparison, addition (subtraction), multiplication and squaring. Typical processor has dedicated instructions for these operations. The main difficulty is related to the implementation of The Approximate Entropy Test, which required the implementation of the function ⋅ ( ). In order to avoid computationally intensive logarithm calculation, we implemented this function using piece-wise linear approximations with 32 segments. As can be seen on Fig. 3 , the approximation (dash line) is almost indistinguishable from the function (full line) resulting in less than 3% error.
IV. RESULTS
We made 8 different designs which covered 9 tests from the NIST test suite. We implement our hardware designs in Verilog HDL and use Mentor Graphics Modelsim SE PLUS 6.6d for functional simulation. All proposed hardware designs are synthesized using Xilinx ISE14. As with most practical implementations, there is no golden way to the perfect system in a generic way, and different applications demand different design trade-offs. As shown in table III, we propose 8 different implementations which support three different input lengths 128/65536/1048576 bits. For hardware design, with the merit of a compact hardware footprint, the 128-bit version can be utilized for lightweight designs for up to seven different tests. On the other hand, the 1048576-bit version has the capability to support long term evaluation and up to nine different tests. The 65536-bit version provides a balanced trade-off between the hardware area and the input length of the random sequence. All our implementations on FPGA have a maximum working frequency larger than 100 , in other words, they can handle an input bit rate of 100 / , which is enough for most of the TRNGs on FPGA.
Our designs are also suitable for ASIC, either as an individual unit or as a building block for processors. In table III we provide the area results in GE (gate equivalence).
Our software designs can be implemented on different hardware platforms, such as microcontroller, GPU and DSP. These embedded systems might utilize different dedicated peripherals, such as a HW multiplier and squarer. The number of required clock cycles is greatly dependent on the SW platform. In table III, We present the instruction count of software implementations for a 16-bit architecture. We can see that the largest version requires more than 900 ADD/SUB and almost 200 MUL/SQR. The reason is that instructions operating on data larger than 16-bit have to be decomposed into several 16-bit operations. We can expect that, on 32-bit or 64-bit platforms, considerably lower latency could be achieved.
Since this is the first unified FPGA implementation of the embedded tests, it is difficult to compare with previously published work. In [13] , individual implementations of different tests are presented. These tests are operating on sequences of different lengths, as shown in table IV. By comparing the total number of occupied slices of the individual tests from [13] with our unified implementation for a sequence length of 65536 bits, we can see that our implementation uses around 20% less slices. However, the comparison is not entirely fair for two reasons: one, because we use a longer bit sequence, and two, because some of the functionality is moved to software. In order to compare the latency, we utilize openMSP430 [17] as the hardware platform to evaluate our design. As expected, the latency of the software routine is higher than the latency of the slowest test from [13] but still much lower than the time needed to generate the sequence.
V. CONCLUSIONS AND FUTURE WORK
In this paper, we have presented a unified implementation of different NIST tests based on splitting the operations between hardware and software. By keeping only the necessary operations in hardware, we have achieved our goal of a low area cost of the hardware part and a sufficient flexibility for the software part.
