Abstract-This paper details the design and implementation performances of an efficient generator of chaotic discrete integer valued sequences. The generator exhibits orbits having very large lengths compared to those given in the literature. It is implemented in C language and parallelized using the Parameterized and Interfaced Synchronous Dataflow Model of Computation (PiSDF MoC). The proposed structure is shown to be scalable, parallel and time efficient. The resulting implementation combines a very long minimal chaotic sequence omin > 7*2 128 32-bit samples and a very high throughput of 173Mbps on 4 cores of a General Purpose Processor.
INTRODUCTION
Many information hiding and security systems such as: cryptosystems, key generation, steganography and digital watermarking systems need pseudo-random number generators (PRNGs) to achieve their objectives. An unpredictable physical process such as plasmonic or resonant systems and quantum physics [1] can be used to produce a non-deterministic random number (or bit). Process-based algorithms are used to generate numbers (bits) deterministically [2] [3] [4] . In the last two decades, a class of PRNGs called chaotic PRNGs has attracted much attention. Indeed, this class presents some desirable properties such as: very big sensitivity to the initial conditions and the parameters of the system, ergodicity, and ease of implementation, that make them very good candidates for use in information hiding and security systems.
We propose in this paper an efficient implementation of a generator of discrete pseudo-chaotic sequences using N = 32 bits finite precision. The proposed structure integrates three techniques: recursion, disturbance and cascading to avoid the dynamical degradation. It is also very efficient in terms of robustness and data rate. The generator results from a French patent [15] and its PCT Extension.
Dataflow modeling and global multicore scheduling are used to generate a parallel executable for the chaotic random generator. Dataflow modeling consists in breaking the computation into independent processes that communicate through First-In First-Out (FIFO) data queues. Global multicore scheduling consists of choosing a core for executing each process and ordering their execution on each core. In this paper, the chaotic generator is described by a Parameterized and Interfaced Synchronous Dataflow graph (PiSDF) [5] that serves as an input for the PREESM rapid prototyping tool [6] .
PREESM is an open-source Eclipse-based tool that can simulate and generate multicore code for a given topology, i.e. organization of cores, and parallelism of architecture within minutes.
The applications of such efficient generator are numerous in the field of information hiding and security. The paper is organized as follows. Related Works are presented in Section 2 and the description of the proposed chaotic generator in Section 3. Section 4 describes the proposed techniques of optimization. Experimental results are given in Section 5, before concluding in Section 6.
II. RELATED WORK
Chaos can be generated by any non-linear dynamical system. In discrete-time non-linear dynamical system the generation of chaos is governed by a set of difference equations, namely a chaotic map. In the literature, the number of well-known one, two and three dimensional chaotic maps is surprisingly limited, in spite of the fact that they are widely used in many applications related-data security. In [7] , we studied some of them, namely, the Logistic map, the piecewise linear chaotic map (PWLCM), the Frey map, and we designed others, specifically, x cos(x), x exp[cos(x)], and 2-D Tmap. We demonstrated that, when used alone, these chaotic maps don't exhibit very good statistical properties and especially their cycle length is limited. These observations are generally available for the others well-known chaotic maps such as: 1-D Skew tent [8] , 2D-Standrd map, 2D-Cat map, 2D-Baker map [9] used in many cryptosystems to achieve the confusion/diffusion effects, or Lozi maps [10] and 3-D Lorenz map [11] used as chaotic generator. Lian et al. [12] studied in detail the performance of the 2-D Standard, Cat and Baker maps. They demonstrated that the Cat map has the smallest key space, but the highest key sensitivity and that it is suitable for cryptosystems using a different key in every iteration (dynamic keys). In this manner the key space is enlarged and key sensitivity is increased. The key spaces for Standard map and Baker map are both larger than that of Cat map. But to keep high key sensitivity, the number of iterations should be larger than 4 for the Standard map and bigger than 12 for the Baker map. These maps are more suitable for cryptosystems in which the same key is used in different iterations. However, to keep a good confusion property, the average distance change in the whole image must be greater than 40%, and then, the number of iterations of the Cat map must be no smaller than 6.
To enhance the statistical properties of the generated sequences, the main ideas are based on the introduction of a technique of disturbance of the pseudo-chaotic orbit and also on the mixing of different components.
In [10] , René Lozi introduced new models of very weakly coupled logistic and symmetric tent maps, based on a matrix of disturbance and using single or double precision numbers. He demonstrated that the 3-coupled tent maps with small value of perturbation can be used as a generator of pseudo-chaotic numbers with a uniform distribution over the interval [-1, +1]. The generated sequences have orbits of very long period, greater than 109 and less than 1012.
In [13] , Thomas E. Tkacik proposed a 32-bit hardware random number generator based on a Linear Feedback Shift Register (LFSR), and a Cellular Automata Shift Register (CASR). The LFSR uses a primitive polynomial of degree equal to 43 and then gives a cycle length of 243-1. The CASR is based on 37-bit with a CA150 at cell site 28, and CA90s at all other cell sites. So, it has a maximal length of 237-1. The output of the generator is formed by 32 bits, selected and analogical multiplexers 8 to 1, 3 XORs; a linear feedback shift register (LFSR), an operation of elimination of a percentage of samples and a block of quantification on Nq < N.
The proposed structure is modular, scalable, generic, and it produces a very long orbit with a large secret key.
Every basic chaotic generator of Fig. 1 . integrates a perturbation technique, based on linear feedback shift register (LFSR), of the chaotic orbit which increases its cycle lengths by a factor (2k-1), where k is the degree of the primitive polynomial of the LFSR. Fig. 2 . shows the recursive structure of each basic chaotic generator (P-G) in the case of third order recursive filters. The discrete Skew-tent map (NLF1) and the discrete piecewise linear chaotic map (PWLCM, NLF2) are used as a non-linear functions. The output (integer pseudo-chaotic values) of each basic chaotic generator X (n) is given by the following equations.
B. Structure of the basic chaotic generator
permuted from the LFSR and the CASR, and then xored together. The cycle length of the generator is close to 280. The bit rate of such system depends on the maximum oscillators
frequencies that can be used, but the authors don't give any information about this important question.
N  pseudorandom generator including a first non-linear feedback shift register (NLFSR 1) with R memory cells combined with a
second one with S memory cells by a multiplication operator. The result is then xored with a third NLFSR with T memory cells to obtain a final signal representing a pseudorandom number. The period length of the output sequences is equal to (2R-1) (2S-1) (2T-1), but there is no information related to the bit rate performance.
Most of chaos-based generators of the literature rely on floating-point data operations. This property leads to a problem when the resolution or the rounding method is different on the sender and on the receiver side. In a cryptographic system the chaotic sequences generated by the emitter and by the receiver will then be different. To avoid this problem, the generator presented in this paper works on a fixed finite precision of N bits. However, with a finite precision N, the chaotic dynamics are degraded and short cycles can appear.
III. DESCRIPTION OF THE PROPOSED CHAOTIC GENERATOR

A. Architecture of the proposed chaotic generator
The architecture of the proposed chaotic generator is given in Fig. 1 . It consists of: 28 basic chaotic generators, 4
Where P1 and P2 are the control parameters for the Skew tent map, ranging from 1 to 2 N 1 and for the PWLCM map ranging from are and 1 to 2 N 1 1 respectively. Q1 and Q2 are the perturbing signals produced by the LFSRs, ranging respectively from 1 to 2 ks 1 and 1 to 2 kp 1 , where ks and kp are the degrees of the primitive polynomials of the LFSRs used to disturb the skew tent map and the PWLMC map. The coefficients c 11 , c 12 , c 13 
C. Sequence of execution of the chaotic generator
The proposed generator of Fig. 1 . is running as follows:
The minimal length of the chaotic sequence at output y is:
It is extremely long.
IV. PROPOSED METHODS FOR OPTIMIZING IMPLEMENTATION (8)
1) Initialization of the twenty eight perturbed generators and the LFSR.
2) In every state j 1, 2,..., 7 of the LFSR (implemented To obtain an efficient implementation of the chaotic generator, parallelization at different levels is carried out. A data flow description is used to parallelize the different actors and the actor execution time is minimized by optimizing the actor C by the following primitive polynomial: g(x) x 3 x 1 and code.
commended by the clock Ck), the length of the chaotic sequence at output y7 is given by the least common multiple (lcm) of several secondary outputs. For example:
A. Dataflow description A dataflow description is used to model the application as a directed graph. Dataflow descriptions provide advanced semantics for expressing parallelism in an algorithm regardless
is the length of the chaotic sequence at output y5 where: (4) of the targeted hardware architecture. The directed edges of the description model the flow of the data and the actors, corresponding to the graph nodes, model the transformations applied to the data. The hierarchical PiSDF description of the
 algorithm is transformed by PREESM into a single-rate graph
that exposes the maximum parallelism from the graph. This graph is displayed in Fig. 3 . It contains actors for PWLCM is the length of the chaotic sequence at output y1 and and SkewTent nonlinear functions, XOR operations as well as a key source actor "ReadKeys" and a sequence sink actor
The "Fork" actor is a generated actor that
automatically distributes data tokens to make the right data available for each actor. In case of large generated sequences, is the length of the chaotic sequence at output y2 and so on. the processing time is dominated by the PWLCM and SkewTent nonlinear functions. In the studied setup, 4 PWLCM and 4 SkewTent functions can be executed in parallel, bringing useful parallelism to the execution. In our experiments, the large amount of data to send between processes limits this parallelism to 2.24 on 4 cores (see Section 5.C). The parallel code generation process is described in [18] . One C thread is generated for each core and synchronized with the other threads. Communications are implemented via shared memory and synchronization.
B. Actor C code optimization
To minimize each actor execution time, the actor C code is optimized. After a code profiling analysis, the optimizations focus on repetitive structures (loops) and mathematical functions, which consume a significant part of the total execution time. Loop parameters are set to constant values at the compile-time to benefit from compiler optimization dedicated to loop. Loop unswitching is carried-out to remove conditional structures from loop kernels. This optimization eliminates over-cost due to tests and conditional branching. Instead of using functions from the mathematical library, functions, like modulo, are rewritten and simplified for our specific use case. These different code optimizations combined with compiler optimizations allow reducing the actor execution time of one order of magnitude.
V. EXPERIMENTAL RESULTS
A. NIST test and mapping
NIST test consists of a battery of 188 tests (globally 15 different tests) to conclude regarding the randomness or nonrandomness of binary sequences [19] . We generated 100 sequences each with a different secret key and containing one million bits, and then we performed on them the NIST test. The proposed generator passes all the tests and therefore, it is robust against statistical attacks. BIT RATE PERFORMANCE may remark that a fair speedup of up to 2.24 is obtained between sequential execution and execution on 4 cores. The performance obtained on 3 cores is very equivalent to the one on 2 cores due to unbalanced core loads. A minimal sequence length is required in order to obtain enough parallelism for the multicore execution. In our use case, a generated sequence of 4K samples = 128Kbits provides a near maximum parallelism. For short sequences, the amount of communication dominates the computation in terms of time. . displays the memory necessary to compute the random sequence generator. For generated sequences over 256 samples, the memory needs grow linearly at a rate of about 128 Bytes/sample. The used dataflow parallelization method provides a quasi-constant memory requirement when the number of cores increases. VI. CONCLUSION We presented in this paper a generator of discrete chaotic sequences and its efficient multicore implementation. The resulting implementation combines a very long minimal chaotic sequence o min 7 2 128 samples and a very high Fig. 4 . shows the bit rate of the random generator for core numbers between 1 and 4 and different sequence lengths ranging from 1 to 32768 32-bit samples. A sequence is a list of samples generated by a single iteration of the algorithm. One throughput of 173Mbps on 4 cores. A possible future work is the extension of the algorithm to obtain higher parallelism for equivalent cycle length properties. 
B. Bit rate Performance
C. Details of the Multicore Execution Performances
