Index Terms-Pseudorandom number generators, discrete dynamical systems, statistical tests, hardware security, applied cryptography, system on chip, FPGA.
I. INTRODUCTION

D
ESPITE its long history, random generation still remains a hot topic, with the emergence of the so-called Random as Service or Entropy as Service [1] needs. It also becomes a key element in lightweight security cores in IoT devices. Finally, cloud services suffer when they have to generate multiple virtual machine instances (VM) from a golden image: they look like to have a very limited ability in randomness harvesting [2] . Despite the common use of these generators in many applications as described above, their integration into System on Chip becomes highly desirable, particularly for IoT and Smart Cards. Therefore, the practical purpose of current research works is to provide compact, high throughput, secure, and reconfigurable pseudorandom generators for hardware applications. Let us recall that a random number generator algorithm is defined by the state space S of the generator, the transition mapping function f , the output extractor function g from a given state, and the seed x 0 [3] . The random output sequence is y 1 , y 2 , . . . , where each y t is generated by the two main steps described thereafter. The first step applies the transition function according to the recurrence x t +1 = f (x t ), where x t and x t +1 both belong to S. The mapping function f can be either an algorithm that deterministically produces random-like numbers in a discrete and finite state space. Such generators are denoted as pseudorandom number generators (PRNGs). Differently, f can be based on a physical source of entropy to produce randomness, thus making S a continuous space. The whole approach is thus called a True random number generator (TRNG). The second step consists in applying the function generator to the new internal state leading to the output y t , that is, y t = g(x t ). There is a large variety of such recursive generators, which can be either linear or not, chaotic... Random number generation is more studied in mathematics for software aspects, whereas hardware and semiconductor solutions are deeply investigated for true random generation. On the one hand, linear PRNGs are a special case of linear recurrence modulo 2 (that is, S is F 2 ). Many research works and solutions are regularly proposed to increase their performance and statistical profile, and their linearity and security are investigated accordingly. Unfortunately, only a few of these linear PRNGs are analyzed in details at the hardware level, such as FPGA and ASIC. On the other hand, chaotic pseudorandom number generators (CPRNGs) are non-linear generators of the form: x 0 ∈ R and x t +1 = f (x t ), where f is a chaotic map. They are an attractive application of the mathematical theory of chaos. Reasons explaining such an interest encompass their sensitivity to initial conditions, their unpredictability, and their ability of reciprocal synchronization [4] . Truly chaotic generators are a good demonstration of these characteristics: their period is infinite, hardware resources are compact, and statistical tests are often succeed quite reasonably [5] - [7] .
One natural question that arises is: how can we inject disorder in a deterministic digital system, while respecting the mathematical definitions of chaos provided by Devaney [8] and Li-Yorke [9] on such finite state machines? An usual answer in digital embedded systems is to consider pseudo-chaotic generators instead of truly chaotic ones [6] , [10] - [12] . In spite of the quality of the TRNG output based on a chaotic phenomenon, most of these techniques are however produced in a manner that is either slow (i.e, in a range of some Kbps to Mbps, to extract noise or jitter from a given component [13] ) or costly (e.g., extracting or measuring some noise using oscilloscope or laser [5] , [14] ). Additionally, to embed these TRNGs in a pure digital platform is an extreme challenge, where the main concern is calibration of the bias phenomenon coming from analog inputs. Digital TRNGs lead thus to an uncontrollable uniformity and performance of the outputs compared to the theory. Conversely, chaotic PRNG appears as a convenient solution in SoC platforms such as Zynq based FPGA [15] .
Additionally, these PRNGs have various drawbacks, particularly they fail statistical tests of linear complexity of their outputs. This work notably illustrates that 32-bits length internal state is sufficient to pass the linear complexity tests only if post-processing operations (permutations, for instance) are applied to scramble the output. Another solution that comes in mind is to enlarge the internal state space whilst conserving the same output length (32 bits). However such a second answer is contradictory with the objective of hardware implementation which is notably to keep resources as reduced as possible.
This article is an extended version of a paper accepted at Secrypt 2016, the 13th International Conference on Security and Cryptography [16] . We have reported the initial design and evaluation of Chaotic Iterations based PRNG (CIPRNG) as a possible post-processing for hardware PRNGs, demonstrating its benefits compared to other linear PRNGs. This proposal focuses on adding chaos (as mathematically defined by Devaney [8] and Li-Yorke [9] ) on linear PRNGs as a post-processing, in which at each iteration, only a subset of components of the iteration vector is updated. In this article, we undertake a deeper evaluation of these linear PRNGs, encompassing statistical tests, throughput & latency, and Berlekamp-Massey algorithm [17] analyze. To the best of our knowledge, no paper has really deeply investigated hardware implementations of such linear PRNGs. Pseudo-chaos generators implemented on FPGA have been considered too in our investigations, which use various map functions like the so-called Logistic Map [18] , the Timing Reseeding [19] , or Differential Chaotic [20] . We also provide a detailed description about the SoC platform for implementation and randomness tests. In addition to the details of the CIPRNG-XOR (presented in [16] ) this article provides two new Chaotic Iterations based post processes, namely Multi-Cycle CIPRNG-MC and Multi-Cycle MultiDimension (CIPRNG-MCMD). The underlying theory is emphasized since our proposal has been completely proven in the rigorous framework of chaos theory. Finally, we improve hardware aspects on FPGA by merging them in ASIC implementation using UMC-65nm Low Leakage Technology, which is done to compare area, throughput, and power consumption of investigated generators without inferring any blocks (DSP, RAM).
This extension of a conference article is organized as follows. Section II discusses hardware design (FPGAs and ASIC) and analysis a set of selected linear and chaotic pseudorandom number generators. Performance is regarded in Section III: frequency, area size, weaknesses, and computation complexity are investigated to select which linear PRNGs can be used for post-processing. Then, in Section IV, we present the mathematical topology foundation of chaotic iterations [21] , while its application for PRNGs is detailed in Section V. We compare the implementation of both linear PRNGs and chaotic iterations on FPGAs using Zynq platform in the next section, while the ASIC implementation is discussed in Section VII. This article ends by statistical tests (Sec. VIII) and a conclusion section, in which our study is summarized and intended future work is outlined.
II. BACKGROUND OF LINEAR AND CHAOTIC PRNGS
A. F 2 Linear PRNGs
Let F 2 be the finite field of cardinality 2. Let us firstly recall that a common way to define a pseudorandom number generator is to consider two functions f and g, and a linear recurrence defined by
where usually N > M, g is one way, x 0 is a seed provided by the user, and y t is returned to the user. Let us remind that linear PRNGs are a special case of linear recurrence modulo 2. Therefore, a linear PRNG of w bits can be defined by the following equations:
Equation (1) defines the function f , where
Equations (2) and (3) define the function g, where y t = (y t 0 , . . . , y t w−1 ) ∈ F k 2 is the w-bit output vector at step t, and B is a w × k output transformation matrix with elements in F 2 . This latter produces the output bits that correspond to the internal RNG state, which is rewritten as r t ∈ [0, 1]: the output at step t. Let us provide some examples of such linear PRNGs.
1) Linear Feedback Shift Register (LFSR):
Wellknown examples of such generators are LFSR113 [22] , LFSR258 [22] , and Taus88 [23] . Look-up Table Shift Register (LUT-SR [24] ) is another LFSR, in which authors propose to turn the use of LUT as a k-bit shift-register using Xilinx SRL32.
2) Linear Congruential Generators (LCGs): PCG32 [25] is an instance of improved LCG: it post-processes a permutation function (dropping bits using fixed and random rotations) to improve the randomness of the outputs. We can also evoke the MRG32K 3a generator [26] (further denotes as MRG32), which is a combined Multiple Recursive Generator (MRG), whose period is 2 191 . KISS124 [27] is another 64-bits (2 124 period) combined PRNG that calls 3 PRNGs: a 64-bit MWC (Multiply-With-Carry), the XOR64, and finally a LCG.
3) Twisted Generalized Feedback Shift Register (TGFSR):
They are based on matrix linear recurrence of n sequence words, each containing w-bits:
where 0 m n. Mersenne Twister [28] , Well512 [29] , and TT800 [30] generators are special cases of TGFSR, which use BRAM memory to READ/WRITE the tree words. 4) Xorshift Generators: XOR64 [31] and XOR128, with a period of 2 64 and 2 128 respectively, are examples of these generators. We can also cite the XOR64 * generators [32] , which scramble the result of a Xorshift using a 64-bit multiplication, leading to a period of 2 1024 and 1024-bit state. Finally, XOR128+ [33] proposes a generator of 128 states based on two XOR64 and the sum of the new and previous generated outputs, with a period of 2 128 .
B. Chaotic PRNG
This section presents a state-of-art of pseudo-chaotic generators which have already been implemented on FPGA.
1) Differential Chaotic PRNG: Authors in [34] propose a digitized implementation of a nonlinear chaotic oscillator system in Rössler format [20] . They solve the Lörenz hyperchaos with other differential systems as the Chen [35] and Elwakil [36] ones, using an approximated Euler numerical approach. An implementation and optimization of this Lörenz equation are given in [37] , in which is used again an Euler approximation with less area but same range of throughput. Finally, authors of [38] have implemented the so-called Oscillator Frequency Dependent Negative Resistors (OFDNR) [39] , while using the same Euler approximation.
2) Chaotic Mapping PRNG: Two different chaotic maps are in general considered: the logistic map [18] and the Hénon map [40] . In [41] , the authors deploy the facilities of Matlab DSP System Toolbox software to implement various ranges of logistic map with various lengths, namely from 16 to 64 bits, where the resources are dependent on the precision (from 24 to 53 bits). Then, authors of [38] compare the logistic map [41] results recalled previously with the Hénon ones. Additionally, these authors propose two optimized versions of chaotic logistic map in [42] , in which they pipeline the multiplication operations and synchronize them, while adding some delays into each stage, in order to ensure a parallel execution of sequences. Finally, in [43] , four different chaotic maps are implemented in FPGA, namely the so-called Bernoulli, Chebychev [44] , Tent, and Cubic chaotic maps. The implementation is done with and without FPGA's DSP blocks for the multiplication operations.
3) Chaotic Based Timing Reseeding (CTR): This main concept [19] was first implemented in FPGA [45] . Instead of initializing the chaotic PRNG with a new seed, the seed can be selected by masking the current state x t +1 at a specific time. They optimize in [45] the arithmetic operators as multiplication with Carry Lookahead Adder, while the authors of [46] mix the output from the PRNG with an auxiliary generator y t +1 to improve statistical tests.
III. QUANTIFYING HARDWARE PERFORMANCE OF PRNGS
A. Methodology
Previously presented hardware PRNGs are evaluated regarding their randomness, which can be done using statistical tests. The objective of such tests is to evaluate whether the output of a given RNG can be separated or not from a truly random sequence obtained, for instance, by rolling a dice. Such tests are usually grouped in batteries, like NIST [47] , and TestU01 [48] ones.
More precisely, the US National Institute of Standard and Technologies has its own battery called NIST SP800 − 22, see [47] . It is constituted by 15 different statistical tests. The binary sequence to evaluate must have a fixed length N, such that 10 3 < N < 10 7 . Then, for each statistical test, a set of s sequences is produced by the RNG under consideration, and p-values are obtained. They all need to be larger than 0.0001 to reasonably consider the associated sequences as uniformly distributed and secure according to the NIST opinion.
TestU01, for its part, is currently the most complete and stringent battery of tests for RNGs [48] , which groups more than 516 tests inside 7 sub-batteries. In this section, we focus on three major sub-batteries, that encompass 319 tests and which are specific to PRNGs. They are, namely, the SmallCrush, Crush, and BigCrush batteries of tests. Big Crush is the most difficult sub-battery in TestU01. This latter uses approximately 2 38 pseudorandom numbers and applies 160 statistical tests (it computes 160 p-values, that must belong to [0.001, 0.999] in order to pass the considered test).
All the aforementioned linear PRNGs have been implemented on FPGA using Zybo board and Xilinx Vivado tools. The underlying design methodology relies on the use of two high levels of implementation, namely the traditional RegisterTransfer Level (RTL) flow and the High-Level Synthesis (HLS [49] ). After applying our experiments, we have obtained that almost all PRNGs pass NIST test but only PCG32, MRG32, and XOR64 * generators can pass the Big-Crush of TestU01, the most stringent part of this battery, which is coherent with the literature. Obtained test results have shown that a particular and common test called linearity complexity is very frequently failed. This behavior is explained in the next section.
The first next subsections focus on 4 criteria, namely: the linear complexity, the jump complexity [50] , the arithmetic operators, and the throughput. Note that the linear and jump complexities are only studied in the linear PRNG case as (1) chaotic PRNGs are not linear, and (2) all these chaotic generators can successfully pass the NIST, which embeds these two complexity tests. Concerning the latter, we further remark that they are currently studied in the literature only for hardware optimization: the novelty of chaotic PRNGs in Table II lies solely in this optimization, and no deeper theoretical study are performed on them. Additionally, they should need to be combined with physical sources to pass the TestU01 battery, which stricto sensu transform them in TRNGs. And such TRNGs become too slow to be evaluated with a so stringent battery [11] . 
B. Linear Complexity
For a given k-length finite binary sequence in F k 2 issued from a RNG, its linear complexity L k is defined as the degree of the shortest characteristic polynomial of the LFSR that can generate the same sequence. Intuitively, non linearity is observed when this degree L k is small. Fig. 1 presents the linear complexity profiles of some PRNGs when applying the Berlekamp-Massey algorithm. PCG32 and XOR64 * , which can pass the whole TestU01, have the linear complexity property. Conversely, other PRNGs like XOR64, WELL512, TT800, and LUT-SR, fail to exhibit such a property.
At this point, we can wonder whether there is any relation between linear complexity and other parameters like the space (resources) used in FPGA. To answer this question, TestU01 computes another parameter named Jump Computation.
C. Jump Complexity
TestU01 battery additionally calculates the number of jumps that occur in the linear complexity for each local subsequence. This number of jumps represents how many bits must be added to the sequence to increase its linear complexity. It has been proven [50] that ideal PRNGs have to perform jumps symmetric to the k/2-line as in a perfect linear complexity, with maximum jump heights of k/4, and close to (k + 1)/2 for each k-length sequence.
Lets us first illustrate some of these properties using Fig. 2 . We compute the linear complexity profiles of the first 32 bits (k = 32) of generators LFSR258, XOR64 * , and PCG32 using the Berlekamp-Massey algorithm, where the complexity level
…Each of these PRNGs performs a jumps symmetric to the k/2-line as illustrated in Fig. 2 . Let us however explain some differences within these jumps. We first notice that the L k (x 0 , x 1 , x 2 , x 3 ) is stable for LFSR258 and XOR64 * . When we add x 4 to compute L k (x 0 , x 1 , x 2 , x 3 , x 4 ), LFSR258 jumps from 1 to 4 whereas XOR64 * is still stable. PCG32, for its part, is stable for less bits and jump by 2 levels in the same interval, where the first jump happens on the x 7 and with more than 8 levels for XOR64 * .
Let us consider a stream of random bits x i = x 0 , x 1 , . . . , x n , in which the perfect jump is the difference between two successive linear complexity level L k applied to x i and that satisfy 0
Regarding FPGAs, these jumps determine how much resources are required in order to have a perfect complexity profile. For illustration purposes, some of these PRNG jumps have been computed in Fig. 3 , by starting from the linear complexity profile L k illustrated in Fig. 2 . More precisely, we computed the jump complexity of 200 linear complexity degrees L k (x) (k = 200 bits = 6 words), on the one hand for XOR64 * and PCG32 that can pass TestU01, and on the other hand for XOR32, TT800, and LUT-SR, who failed this battery.
Let us take XOR64 * and LUT-SR as demonstrators of each category from Fig. 3 . The aforementioned 200 complexity linear levels illustrate that XOR64 * needs a minimum of 52-bits jump to perform a symmetric k/2-line (maximum jump heights of k/4). However, only 38 jumps are perfect (< 2), where L k (x) can possibly be repeated between jumps. In addition, we consider stable situations where no jump has occurred (streams of repeated L(x) = L(x − 1)), where unstable jump is repeated only once. Indeed, we conclude that useful bits are the minimum unique bits, which does not present any form of stability in complexity profile L k .
We can see that LUT-ST is 4 perfect jumps lower in total than XOR64 * . The latter will be propagated for a long period of time, which conducts to a less useful bit contribution for passing linear tests. It is more obvious for XOR32, which confirms the need to another process to face this issue. Indeed, PRNGs that fail to pass TestU01 have the lowest number of useful bits and of perfect jumps, when compared to successful ones. Note that XOR64 * uses a multiplication as a kind of output scrambling. PCG32 has the same perspective in its multiplication use, so why it has less useful bits at the end while passing linearity test? To answer this question, we can focus on Fig. 1 , which illustrates the existence of stability in linear complexity starting from shorter periods of time.
Some periods can be long, as in the case of PCG32 for instance. When the PRNGs are running, the states space used 
are constant for any operation. Such property is obvious in 32-bit LCG generators like the PCG32. The latter deploys 32-bit multiplications (64-bits state), but a 36-bits state is required to pass TestU01 with a 32-bit output. This fact means a loss of information that can create a new jump in complexity. This is why PCG32 applies a permutation function to scramble the weak least significant bits (LSBs) after the multiplication.
Let us now consider the XOR64 * generators, which also use 64-bit multiplications. Their linear complexity is close to the perfect one. The key difference here is the permutation function used for multiplication. In LCG family, this is the main function applied to perform an uniform scrambling operation, whereas in XOR64 * , they are deployed to inject bias in randomness.
On the other hand, we can notice the uniform distribution of Mersenne Twister, with an unique maximum perfect jump. But it has the largest stable jumps, that will finally be stable once and for all. This indicates the limitation of tempering unit (similar to XOR32 or LFSR) in terms of performance of transition unit.
At this point, the issue that may be worth mentioning is that most of the chaotic PRNGs reviewed in this paper are not answering the targeted question. That is, how can we inject disorder in a deterministic digital system, in order to respect on such finite state machines the mathematical definitions of chaos provided by Devaney [8] and Li-Yorke [9] ? Indeed, passing the NIST, which is the most usual way to evaluate a PRNG, can be put into default: some generators of pour quality can successfully pass these tests.
To solve this issue, we will see in Section VI-B the usefulness of chaotic iterations as a post-processing replacing this tempering unit (see CIPRNG-MCMD).
D. Experimental Results
The aforementioned PRNGs have been implemented in our Zynq platform (Fig. 4) . Both categories (linear and chaotic PRNGs) are analyzed in terms of hardware resources and throughput/latency. The analysis is based on Xilinx Vivado v16.3 tools with the default configuration and without any optimization. Additionally, the FPGA target was Zybo Zynq-7000 ARM/FPGA SoC Trainer Board from digilent (125Mhz). 
1) Hardware Resources:
The size and performance of the PRNG depend on both the word length (addressing the LUT increases the table exponentially) and their binary representations, regarding dynamic range (DR) and precision (DR fxpt = r n − 1 where r is in binary format (Radix-2) and n is the number of digits in fixed-point precision). The aforementioned PRNGs in this section have a fixed DR and internal space of 32 or 64 bits.
Reviewing Table I , LUT-SR, Taus88, and XOR64 require the lowest amount of area resource. Conversely, combined PRNGs like KISS124 and MRG32, LCG, and TGFSR families have large area consumption due to their implementation of arithmetic multiplication with complex logic as DSP. let us take for instance the KISS127 of DR = 2 64 as an example, which is implemented with DA (KISS-DA) or DSP blocks (KISS-DSP). It is clear from Table I that disabling DSP will induce a huge area extension and a drop in frequency while presenting the same latency (even though DSP blocks can be a convenient alternative and an additionnal resource for ASIC applications). To sum up, chaotic PRNGs have approximately the same use of hardware resources than linear PRNGs (logistic map implementations have proven to be the lowest among them).
2) Throughput and Latency: Let us recall two proprieties based on the frequency, which are namely the latency and the throughput. Latency is the number of iterations required to compute a new output from a given input. The throughput, for its part, is the number of iterations needed to produce new output or to consume a new input. Note that the throughput delay can be equal to the latency, which lead us to use the throughput/latency value to estimate the real throughput of the PRNG. In the other hand, the HLS flow schedules automatically cycle-by-cycle the algorithms as a finite state machine. Therefore, the synthesis tool adds one cycle to process the input and a second one to generate the final output.
Latency and throughput in the RTL and HLS flows can be formalized as follows. On the one hand, for 32 bit linear generators (resp. for 64 bits ones), Taus88 and LUT-SR with LFSR113 (resp. XOR64 and LFSR258) have the largest throughput performance, while for chaotic PRNGs, Bernoulli [43] and logistic map using DSP [42] with CPRNG based timing reseeding [45] have the larger throughput. On the other hand, two implementations of Mersenne Twister generators have been designed with and without the seed, respectively denoted as MT_WS and MT_NS. We have remarked that, when considering the seed as a function, frequency is reduced to less than 200MHz compared to the case where it is not present. Therefore, to increase performances, most PRNGs do not include the seed internally (and a software is used).
To put it in a nutshell, if we take the ratio of area/throughput as main criterion, we are balancing between high performance as XOR64, LFSR113 for linear PRNG (resp., Bernoulli [43] and logistic map [42] for chaotic PRNG) and the ability to pass statistical tests (PCG32 and XOR64 * ), which is not surprising. Another result is that combining PRNGs leads to a performance decrease in hardware level. Such combinations do not take into account the Chaotic Iterations post-processing, which appears as promising [21] . Indeed, chaotic PRNGs outperform the linear ones in terms of throughput performance, but they are not able to pass the TestU01 statistical battery (see Table II ). This lack is the reason to be of this contribution: we propose a hardware chaos-based post-processing module to improve the randomness of linear PRNGs. By doing so, and conversely to the other chaotic PRNG, these postprocessed generators behave chaotically while succeeding in passing the TestU01 test. Effects of such a post-processing on performances at hardware level are detailed in the following sections.
IV. CHAOTIC ITERATIONS: THE THEORY
The generators (or, more exactly, the post-treatment over existing generators) we propose in this document are theoretically formalized by the so-called chaotic iterations (CIs), and their performances are directly related to the topological properties of these CIs. The latter are investigated in this section, while the relations with our generators are detailed in the next one.
Let f be a map from B N to itself, and let us introduce the following functions:
where B = {0, 1}, P(X) is the set of subsets of X, X N is the set of sequences whose elements belong to X, and δ( j, P) = 1 if j ∈ P, else δ ( j, P) = 0. For N ∈ N * , let X = P 1; N N × {0, 1} N , with the distance between two points X = (S, E), Y = (Š,Ě) as follows:
where
where |X| is the cardinality of a set X and A B is for the symmetric difference, defined for sets A, B as
. Chaotic iterations are defined by the following discrete dynamical system [51] :
The asynchronous iteration graph associated with f is the directed graph ( f ) defined by: the set of vertices is B N ; for all x ∈ B N and i ⊂ 1; N , the graph ( f ) contains an arc from x to F f (i, x) labeled by subset i . We have previously established that [52] , [53] :
Proposition 1: If ( f ) is strongly connected, then G f is strongly transitive: for any couple (x, y) ∈ X and for all neighborhood V of x, there is z ∈ V and n ∈ N such that f n (z) = y.
Thus it is chaotic according to Devaney [8] 
, i.e., it is 1) Transitive: For each couple of open sets A, B ⊂ X , there exists k ∈ N such that f (k) (A) ∩ B = ∅. 2) Regular: Periodic points are dense in X .
3) Sensible to the initial conditions:
We start to further investigate the disordered behavior of chaotic iterations, on which our generator is based, with the following result: (({1}, {1}, {1}, . . .), (0, . . . , 0) ) of X . As the iteration graph is strongly connected, then G f is strongly transitive. So there is a point x (0) in the neighborhood B(x, r ) of x and an integer n (0) such that G n (0) f (x (0) ) = y (0) . This point x (0) is necessarily of the following form:
• Being inside B(x, r ) with r < 1, its second coordinate (the Boolean vector) must be the same than x, due to the Hamming distance in d. In other words,
• Let n 0 = − log 10 (r ) . Having regard to the definition of d and as x (0) ∈ B(x, r ), we necessarily have an equality between the n 0 first terms of the sequence x 1 and the n 0 first terms of the sequence x (0)
, it is a necessity that after n (0) shifts of the sequence of x (0) , we obtain the sequence ({1}, {1}, {1} , . . .) of y (0) .
• Finally, terms of the sequence of x (0) between positions n 0 + 1 and n (0) are the ones required for f to transform the Boolean vector of x (0) to the one of y (0) (this is the path to follow in f , to reach y 
We now proceed similarly for points having (0, . . . , 0, 1) as Boolean vector. Consider now the point y (1) =  (({1}, {1}, {1}, . . .), (0, . . . , 0, 1) ) ∈ X . For the same reasons than previously, there exists a point x (1) of B(x, r ) and an integer n (1) such that G n (1) f (x (1) ) = y (1) . Let us now consider a point Y (1) of the form:
• its Boolean vector Y (1) 2 is equal to y 
;
• its sequence Y (1) 1 is of any kind. Then the point X (1) defined by:
is such that X (1) ∈ B(x, r ) and G n (1) f X (1) {1}, {1}, . . .), (1, . . . , 1, 1) ), which leads to the definition of n (2 N −1) , of x (2 N −1) , of Y (2 N −1) , and finally of X (2 N −1) .
At this stage, we can claim that, for all y of X , it is possible to find x ∈ B(x, r ) and a certain integer N ∈ {n (0) , . . . , n (2 N −1) } such that G N f (x ) = y. The last issue to solve is that the iteration number N depends on the Boolean vector y 2 , which should not be the case.
Let us consider
, in such a way that:
which is equivalent to be on a treadmill once reaching the target Y (k) and until having iterated N 0 times. Thanks to that, for all y ∈ X , it is possible to find x ∈ B(x, r ) such that
f (x ) = y, which is the expected result. Therefore, however small the starting open ball, we finish to reach the whole X space by iterating G f . Using this result, we can deduce the following proposition related to chaos. 
Proposition 3: General chaotic iterations G f are topologically mixing: for all couple of nonempty open sets U and V , there is n
The point X (1) defined by:
1,k : the two sequences start by the same terms;
• X (1) 1,n 0 +1 = ∅: we insert an empty set at position n 0 + 1 in sequence X
Similarly, by incorporating l empty sets between positions n 0 + 1 and n 0 + l inside the sequence of X (0) , we are able to define a point X (l) , which is such that G
This inequality being valid for all l > 0, we can deduce the topological mixing of G f . [54] : they are sensible to the initial condition and they have a dense orbit.
Proposition 4: When considering the vectorial negation for f , the general chaotic iterations satisfy the Knudsen's definition of chaos
Proof: The sensibility to the initial condition of G f has already been stated in [52] . We are then left to construct a point x ∈ X such that the set {G n f (x) | n ∈ B} is dense in X : iterations of G n f (x) must be as close as possible to any point y ∈ X .
Let us denote by s 0 , s 1 , . . . , s 2 N −1 the list of each subset of 1, N : s 0 = ∅, s 1 = {N}, s 2 = {N − 1}, s 3 = {N − 1, N}, ..., s 2 N −1 = {1, 2, ..., N}. Let us now consider a point y ∈ X . Its Boolean vector y 2 can be associated to a given s k , namely the subset of 1, N that contains the coordinates of 1's in y 2 . The first term y 1,0 of sequence y 1 , for its part, is a given s k , while the second term y 1,1 is too a given s k , with k, k , k ∈ 0, 2 N − 1 .
Let us now remark that, when iterating G f on the point ((s k , s k , s k , s k , ...), (0, 0, . . . , 0) ), with f the vectorial negation:
• We start on the Boolean vector (0, 0, . . . , 0) , s k , s k , s k , ...), (0, 0, . . . , 0) ) will first move the system at a distance 10 −1 to y, and then come back to (0, 0, . . . , 0) after shifting 4 times the sequence.
Let us now consider the point:
By iterating G f on it, we will be at one time at 10 −1 of any point of X , while recovering the null Boolean vector at each 4 iterates. Continuing the process with patterns of length 6, 8, 10, etc., will define a unique point x whose iterates are as close as possible to any point of X , leading to a dense orbit.
V. CHAOTIC ITERATIONS AS PRNGS POST-PROCESSING
A. CIPRNG Multi-Cycle
As described in the previous section, the general chaotic iterations receives an integer sequence as input (and the first internal state, a binary vector), and it produces a sequence of binary vectors. In other words, chaotic iterations translate a sequence in another sequence. This is a way to obtain a new pseudorandom number generator from a former one. Both the kind of inputted generator and the iteration function f are parameters of this post-treatment, while the first vector x 0 and the first term S 0 act as seeds. As the latter are the initial condition of the discrete dynamical system of Eq. (8) , to choose f such that this dynamical system behaves chaotically seems to be interesting in a pseudorandom generation context. In other words, we hope that chaos bring by the iteration function will lead to a more disordered output (x t ) t ∈N than the input (S t ) t ∈N . Even if there is, stricto sensu, no theoretical relation between randomness and chaos (similarly, there is no relation between security and chaos), numerous simulations have illustrated [53] that, due to chaos, the output sequence is in general more random than the input one, according to the number of statistical tests they can pass.
Such chaotic iterations based post-treatment over existing PRNGs can be designed as follows. As we need to generate a sequence S t t ∈N of subsets of 1, N , we can consider two input generators, both producing numbers in 1, N . The aim of the first generator is to provide, at each iterate t, the size of the subset S t , while the second generator produces the content of S t . This way to post-operate over the input generators is what we called CIPRNG Multi-Cycles.
The basic design procedure of this latter is summarized in Algorithm 1. The internal state is x, the output state is r . The internal values a and b are computed by the two input PRNGs. Lastly, the value g 1 (a) is an integer defined as in Eq. (9) . To do so, a sequence d s (= (d 1 , d 2 , . . . , d N ) ∈ {0, 1} N ) called a irregular decimation is provided for the second generator b, which insures that we do not have two successive permutations of the same bit within a given iteration. This latter will update the i -th bit of b at iteration m t , and by using the strategy, if and only if d b i = 1, otherwise it is discarded. For instance, let us consider the input x = {x 1 , x 2 , x 3 , x 4 }, the number of iterations m t = {4, 3, 4, 1}, and b = {2, 3, 1, 1, 4,4, 3, 1, 2, 3, 2,2,4,4, 1, 2} . Due to the first value of m t , we have to iterate 4 times b, and then 3 and 4 times. We then have to operate the decimation on b so that we will not modify twice a same component in a given iteration: this leads to the strategy S = {{2, 3, 1, 4}{4, 3, 1}{2, 3, 2, 4} . . .} extracted from b. As can be seen, the duplicated entry 2,2,4,4 has been decimated to 2,4, while it is not the case for the first 4,4, as according to m 0 this duplication falls between two iterates. This constraint explains the general form of m t provided in Eq. (9) .
Such CIPRNG-MC, which is a sub-category of our CIPRNG post-treatment, can be summarized as follows [52] . x 0 is the initial Boolean vector of size N, and (S t ) t ∈N is the sequence resulted from the irregular decimation of (m, b), as described previously. We suppose that this latter produces numbers belonging into 0, 2 N − 1 . Operating G f with the vectorial negation on such sequences can be directly rewritten 
m ← g(a)
6:
b ← P RNG2() mod N as follows [52] :
where S t is expressed in the base-2 numeral system as a binary vector of size N, while ⊗ is the bitwise XOR operation over binary vectors. In other words, CIPRNG-MC is equal to the chaotic iterations with the vectorial negation and the decimation S of the two inputted generators m and b. Note that, most of the time, we need to iterate the second generator more than the cardinality of S t , as we can obtain twice the same number. This weakness in the decimation process is at the origin of our second proposal, namely the CIPRNG-XOR.
B. CIPRNG-XOR
Conversely to CIPRNG Multi-Cycles, this CIPRNG-XOR only needs one inputted generator. It operates on it using the vectorial negation. We have established in [55] that G f satisfies various properties of chaos with such iteration function, one of them being the notion of chaos according to Devaney, which is studied in this article. Another interesting property proven in the aforementioned article is that, if the inputted generator is cryptographically secure, then the resulted CIPRNG-XOR generator, obtained after post-processing, still present this property. Once again, such CIPRNG-XOR is a sub-category of our CIPRNG post-treatment. If we consider again that x 0 is the initial Boolean vector of size N, and (S t ) t ∈N is the sequence generated by the inputted generator (producing numbers belonging into 0, 2 N − 1 ), then operating G f with the vectorial negation on such sequences can obviously be rewritten as Eq. (10) [52] . We found again a direct equivalence between chaotic iterations using the vectorial negation and this CIPRNG-XOR. The main requirement is to prevent the machine from working in silos, by taking at each iterate a new input from the outside world (an entropy source like a physical white noise or some digits in the CPU temperature, can be considered for instance). By doing so, the finite state machine does not necessarily enter into a loop: a same state can be visited twice, but with two completely different future evolution, depending on the inputs the machine receives.
Algorithm 2 presents details of this approach where 3 PRNGs are embedded to compute the strategy. In the updated version we implemented, two inputted PRNGs of 64 bits denoted by x i and y i are used for defining the chaotic strategy S. Furthermore, we added a third inputted set generator z i of 32 bits for more complexity. The z i generator will pick randomly a subset of the inputs at each iteration as described in Equation 10 , in which only the log(log(n)) least significant bits (in this case, 3 bits) are used. u i ← P RNG1, 3: y i ← P RNG2, 4: z i ← P RNG3 5: if (z i &1) = 0 then 6:
Algorithm 2 CIPRNG-XOR
if (z i &2) = 0 then 8:
if (z i &4) = 0 then 10:
r ← x ⊗ (y i 32) 12: return r
VI. FPGA IMPLEMENTATION BASED ON ZYNQ PLATFORM
A. General Presentation
Xilinx Zynq-7000 Extensible Processing Platform (EPP) [15] is a silicon system on chip (SoC) for FPGAs, which has been proposed by Xilinx. This SoC deploys the latest technologies of ARM processors with a large set of peripherals (DDR, PCI, etc.). This latter is defined as Peripheral System (PS), which is a sub-system with ARM. The full FPGA is the Programmable Logic (PL) that is connected with PS through an AXI bus interface. Fig. 4 shows the detailed hardware architecture of our system used to integrate and test CIPRNGs. The AXI-PRNG interconnect can handle many PRNGs/CIPRNGs at the same time and it activates the one that is currently tested. This interconnect component is re-configurable using the firmware, which deploys two GPIO IPs for this task. GPIO-0 is used to select one PRNG at a time, and GPIO-1 is used for the data burst size of the PRNG. For instance, all PRNGs implemented in HLS or RTL, including the AXI-PRNG interconnect, are AXI Stream Interface, while the CPU is Memory-Mapped Interface. Additionally to CPU, the AXI-DMA engines, which oversee the data transaction between the slave and master IPs, deploy the receiver channel Slave to Memory Map (S2MM) connected to a slave port, and the transmitter channel MemoryMap to Slave (MM2S) connected with the master. Final outputs are displayed in an external terminal via the UART protocol.
We have used the Zybo board (XC7Z 010 − 1C LG400C) as a prototype kit for experiments such that the clock is configured at 125Mhz. The total space of the logic part (PL) on Zybo board is: 2, 982 LUT (19%), 4, 071 FF (11%), 7 DSPs, and 3 memories respectively.
B. Global Comparison
As stated previously, the objective is to determine the performance of CIPRNG implementation in terms of area (space) and throughput (speed). The Xilinx tool calculates all resources used in FPGA as logic gates, LUT, Flip-Flop (register), additionally to DSP and memory blocks. Despite the fact that Xilinx calculates the area by counting slices (1 Slice = 4× LUT + 2× FF+interconnection), it uses the same LUT of 6-inputs for all its technologies (Virtex5, Virtex6, Virtex7, and Zynq). Hence, for our area comparison, we only calculated LUT and FF as [(LU T + F F)×8], since DSPs and RAM memories are hard blocks that can mostly affect time performances. The brute throughput and the rate throughput over the latency are calculated as in Equation (5). This second scalar value is a complementary indicator that provides an accurate speed information about the generators.
Regarding CIs based post-processing, we tested more than 275 versions of CIPRNG-XOR on our Mésocentre supercomputer facilities (170 were able to pass TestU01) and 169 of CIPRNG-MC/MCMD (93 pass the TestU01). Only are recalled hereafter those who pass the recommended statistical TestU01 battery. To reach a fair comparison, we disabled the use of DSP blocs for linear PRNGs. Additionally, having the ASIC implementations in mind, we excluded each CIPRNG combination that deploys BRAM or DSP macros (MT, TT800), to be independent from the technology.
Results concerning CIPRNG-XOR and CIPRNG-MC (respectively CIPRNG-MCMD) are summarized in Table III  and Table IV (resp. in Table V ). In the former table, we specify which combination has been studied. In examples contained in these tables, A is for XOR64, B means XOR128+, and C is LFSR258. Values 1,2,3, and 4 correspond to Taus88, LFSR113, XOR128, and XOR32 generators respectively. 1) CIPRNG Multi-Cycle: As recalled previously, this particular version of chaotic iterations post-treatment is based on two inputted PRNGs. For FPGA implementation, 7 CIPRNG combinations have been selected for their hardware performance. According to results presented in Table IV , throughput of CIPRNG Multi-Cycle is larger than those of almost all linear PRNGs that pass TestU01 (PCG, MRG32, and XOR64*). Additionally, the consumed area is globally small, even if 2 PRNGs are embedded and without inferring any blocks (DSPs and BRAM). Regarding statistical evaluation, all the selected combinations succeeded the TestU01, contrary to all other chaotic PRNGs based on Hénon [38] , Lörenz and Chen [34] , and Tent [43] maps.
2) CIPRNG-XOR: In this last version, 7 other combinations of CIPRNG-XOR generators have been selected for their hardware performance, when compared with linear PRNGs (see Table III ). The results illustrate a throughput to generate 32 bits 2.5 times larger for CIPRNG-XOR than for almost all linear PRNGs that can pass TestU01. Furthermore, if we consider the Thoughput/Latency ratio, CIPRNG is respectively 12 times, 30 times, and finally 7 times faster than XOR64 * , PCG32, and combined PNRGs (MRG32 and KISS124). Additionally, when DSPs blocks are disabled on use, CIPRNG-XOR is 25 times, 44 times, and finally 35 times faster than XOR64 * (0.34Gbps), PCG32 (0.2Gbps), and combined MRG32 (0.25Gbps). The same statement holds for area: CIPRNG-XOR deploys 3 PRNGs, but it is 5 times more efficient than any other linear PRNGs. Compared to all other aforementioned chaotic PRNGs, all configurations of CIPRNG-XOR are more efficient in throughput, area, and ability to face statistical tests. Therefore, for FPGA application, all combinations can contribute in hardware performance and statistical tests compared to linear PRNGs. Finally, compared to CIPRNG-MC, the CIPRNG-XOR is less compact in area resources, but largely more efficient in terms of throughput.
3) CIPRNG Multi-Cycle Multi-Dimension: This is a final extended version of CIPRNG-MC, in which we apply our post-processing (Algorithms 1) on TGFSR family (Mersenne twister and TT800) and when tempering function is disabled. This latter offers to us a well-uniform and multidimensional distribution. As can be seen in Table V, post-processing improves generators, in such a way that they are able to pass the statistical TestU01 battery, while providing improved performances with almost all chaotic PRNGs, as the ones of [34] , [38] , [43] . Due to such qualities, these new types of CIPRNGs can thus contribute to parallel processing and computation applications, like in Monte-Carlo simulation. 
A. General Presentation
Compared to FPGA flow, the ASIC one consists of implementing our design in a specific process technology at transistor level. In our case, UMC-65nm LL represents the process technology node, where the Cadence tools v14 are the main software for the implementation purpose.
Table VI summarizes the ASIC implementation, which uses two global flows: the synthesis flow using Cadence RTL Compiler, and physical place and route (P&R) flow in a second step, with Cadence Encounter Digital Implementation. Both flows include Switching Activity Interchange information generated from simulation process for timing and dynamic power estimation (1 million samples). In addition, signoff verification flow is used to close timing and power requirements. The condition operation mode for the technologies deployed in each flow is as follows: the synthesis is based on one mode using the Worst Case library (WC=108°C and 1.08 Volt), while Multi Mode Multi Corner is applied for P&R flow including both worst and best case library (BC=−40°C and 1.32 Volt).
B. ASIC Comparison
The result analyzes of the various ASIC implementation of CIPRNG can be summarized as follow.
1) Area Analysis: When dealing with ASIC implementations, two measures can be considered to evaluate area consumption: either the Gate Equivalent (GE = Area / (AND gate area for 65nm = 1.44μm 2 )) or the number of transistors (TE = GE×4, where AND has 4 transistors). This latter is independent from the technology estimation of the area of the circuit. It is obvious that CIPRNG-XOR needs twice the area of CIPRNG-MC, due to the use of three generators. For CIPRNG-MC, [1, 3] , [2, 1] , and [4, 1] 3) Power Analysis: Concerning power analysis, we estimated both static and dynamic power, which compute leakage and switching&internal power of the design. The leakage power measures each cell (logic) in various states, while dynamic power depends on the initial state of cells, the toggling input, the transition rate, and the output capacitive load. In Table VI , various dynamic power analyzes illustrate a low power consumption of both CIPRNG-MC and CIPRNG-XOR. It is clear from Table VI that, when we propagate the clock (switching), the switching power of the CIPRNGs is lower than the internal power consumed by the internal cell of CIPRNGs. This is confirmed by the area of both CIPRNG family. Despite such results, CIPRNG-XOR consumes twice the power of CIPRNG-MC, which is balanced by frequency and throughput. We can finally select the combinations 
VIII. STATISTICAL TESTS
During experiments, the test batteries are run in Z-book Intel Core i 7−4800M QC PU@2.70G H z×8, working with Ubuntu 16.4 (64bits) and GCC 5.4.0. For NIST, 100 sequences of 10 6 bits are generated and tested. The results confirm that all the chaotic iterations post-processings for linear PRNGs can pass the NIST, where the minimum passing rate for each statistical test is approximately 96 for a sample size of 100 binary sequences. In the TestU01 case, all CIPRNG configurations for both proposals (MC and XOR) can successfully pass this battery, which is failed when considering the other chaotic PRNGs evoked in this article.
IX. CONCLUSION
In this paper, which is an extension of [16] , we have presented a new family of post-processing PRNGs based on chaotic iterations for FPGA and ASIC. This work has studied the performance of various linear PRNGs implementations in FPGA regarding the linear complexity, seed size, arithmetic operations, and throughput/latency. In order to investigate these parameters, a SoC based on Zynq EPP platform (hardware and firmware) has been developed to accelerate the implementation and tests of various PRNGs on FPGA. The results are used as sources of information in the design of an hardware post-processing treatment based on chaotic iterations. This latter has been considered to improve the statistical profile of flawed generators. The conclusion that can be outlined is that chaotic iterations post-processing provides an alternative implementation of combined PRNGs without any supplemental cost, which is 2.5 times faster and 5 times more efficient than almost all the linear PRNGs that can pass TestU01.
