In this paper, a high-throughput Cryptographically Secure Pseudo-Random Number Generator (CSPRNG) using the bitslicing technique is proposed. In such technique, instead of the conventional row-major data representation, column-major data representation is employed which allows the bitsliced implementation to take full advantage of all the available datapath of the hardware platform. We use LFSR-based (Linear Feedback Shift Register) PRNG for our implementation since its register oriented architecture perfectly suits the GPU's many-core structure and allows for usage of bitslicing technique which can further improve its performance. In our GPU implementation, each GPU thread is capable of generating a remarkable number of 32 pseudo-random bits in each LFSR clock cycle. In order to obtain cryptographically suitable properties, we propose an SIMD vectorized fully parallel bitsliced implementation of the LFSRbased, cryptographically secure MICKEY 2.0 stream cipher algorithm for CSPRNG. To the best of our knowledge, our method not only achieves better performance, but also significantly outperforms optical solutions in terms of performance per cost while maintaining an acceptable measure of randomness. It should be mentioned that our implementation successfully passes the NIST test for statistical randomness and bitwise correlation criteria. Our proposed methodology significantly outperforms the current best implementations in the literature for computer-based PRNG. Moreover, our evaluations show 6.6x improvement over the Nvidia's proprietary high-performance PRNG, cu-RAND library, achieving 5.2 Tb/s of throughput on the affordable Nvidia GTX 980 Ti.
I. Introduction
The emergence of cost-effective, high-performance parallel platforms such as Graphical Processing Units (GPU) and their programmability have allowed researchers across various fields of science and engineering to utilize the specialized processing capability of GPUs to accelerate their computationally demanding applications. GPUs' processing power has been fully leveraged for implementation of machine learning algorithms [1, 2] , medical image processing [3] , and many other applications.
Recently, the high-performance execution on GPU has attracted the attention of many researchers to adapt cryptography problems for execution on massively-parallel GPU platforms [4, 5] . One problem which is the particular concern of this paper, is the high-throughput generation of sequences of pseudo-random numbers. The high-performance random number generation with an acceptable criterion of randomness is a vital necessity in many computer science disciplines, including stochastic computing [6, 7] , stochastic simulation, i.e. Monte Carlo simulation [8] , and cryptography [9] .
For cryptography purposes, the underlying pseudorandom number generator process, apart from statistical randomness, must accompany other security assurances that vary based on the intended application. The Cryptographically Secure Pseudo-random Number Generator (CSPRNG) processes work based on increasing the entropy of the output sequences which makes the output sequence to be indistinguishable from uniformly random bit sequences. Moreover, the unpredictability of next-bit must be further guaranteed. Here, we intend to apply the bitslicing technique in the software implementation of CSPRNG processes.
In the bitslicing technique, by altering the representation of the input data and computations, we strive to first, increase the utilization of the computation units, and second, reduce the required operations from costly operations to hardware-friendly basic bit-level operations, such as XOR, AND, and OR operations. With the incorporation of the bitslicing technique in our implementations, we can achieve highly-parallel, vectorized execution in the SIMD manner. [10] has proposed the successful utilization of the bitslicing technique in the implementation of the Data Encryption Standard (DES) on a 64-bit processor, where the processor is viewed as an SIMD processing unit. In [10] , by introducing the bitslicing technique in the implementation of the DES, the 64-bit CPU can be perceived as 64 1bit CPUs that process 64 chunks of data, simultaneously. In [5] , the authors have proposed the high-throughput implementation of the bitsliced DES exhaustive key search cryptanalysis technique on programmable GPU platforms.
A characteristic of the software implementation of LFSR-based PRNGs is the intrinsic need for repetitive, costly bit-level shift and mask operations. By proposing the bitslicing technique and changing the data and com-putation representation, we have successfully transformed the aforesaid costly bit-level shift and mask operations to more efficient register swapping techniques. Moreover, with the bitsliced implementation of CSPRNGs, we have reached full utilization of the computational units' processing capability. In the bitslicing representation technique, instead of processing data from one data chunk, each computational unit processes 32 data chunks, simultaneously. This fact, alongside the reduction in the computational complexity of the required operations, greatly improves the execution throughput of our PRNG implementations.
In this paper, as our main contribution, we have proposed a high-performance bitsliced software implementation of the stream cipher Mickey 2.0 pseudo-random number generator. Moreover, we have evaluated the cryptographical security of the generated pseudo-random sequences from our implementation with NIST test, where our implementation has fully passed the requirements and have performed roughly ideally in the remaining constraints. To the best of our knowledge, our implementation outperforms all the available implementations of PRNG in terms of performance. In GPU implementation, our implementation outperforms the Nvidia's proprietary cuRAND, random number generator, by 6.6 X.
To showcase the true capability of the bitslicing technique in software implementation of PRNGs, we have used multiple Nvidia GPUs and performance monitoring and evaluation tools. Although, we have proposed and evaluated our technique on the GPU platform, the bitsliced implementation of pseudo-random number generators can be fruitfully applied to other hardware and software implementations and is not bounded to any kind of platform. The rest of this paper is as follows:
In section II, we will introduce a detailed background on the bitslicing technique, Mickey 2.0, and the NIST test. Section III dicusses the related efforts to PRNG and RNG implementations. Section IV gives our proposed methodology and elaborates on the incorporation of the bitslicing technique in our implementation. Section V, gives the evaluation results achieved from the performance and correctness of our proposed methodology on multiple GPUs. Section VI concludes the paper and discusses future works.
II. Background
In this section, we will give a complete and comprehensive background on the bitslicing technique and the literature on pseudo-random number generation. Moreover, the underlying mechanisms of linear-feedback shift register, LFSR and stream cipher Mickey 2.0 will be discussed in details.
A. Bitslicing technique
Originally, bitslicing, although not under the name of bitslicing, has been used to build n-bit processors from 1bit processors. n 1-bit processors are cascaded together with sufficient control units, to create processors with higher word-length. In this technique, each processor operators on single bits from the data, now, the ensemble of n 1-bit CPUs can replicate an n-bit processor.
Apart from this usage, bitslicing has found its way into implementation of hardware and software highly parallel solutions in different disciplines. Eli Biham has proposed the use of bitslicing in the implementation of the DES encryption/decryption on a 64-bit CPU where the 64-bit CPU is viewed as 64 1-bit CPUs [10] . Each of the 64 1bit CPUs, performs parallel operations on a single-bit from different chunks of data. This way the 64-bit CPU, instead of operating on 64 bits from a single data chunk, operates on 64 bits from 64 different data chunks in parallel [10] . Hence, we can achieve full SIMD execution on data.
In the bitslicing technique used for software and hardware implementation of different applications, we alter and change the representation of the data and computation from the conventional representation to the column-major representation. By conventional representation, we mean the data representation method used in conventional programming language and hardware representation of the data. In the conventional row-major representation, each chunk of data is stored and referred to in single or multiple registers in the row-major format. Meaning that, for the storage of 128 bits from data, four 32-bit registers are used where R 0 stores bits with indices from 0 to 31, R 1 stores bits from 32 to 63, and so on until bits with indices of 96 to 127 are stored in R 4 . Notice that in order to store 128 bits of data 4×32-bit registers are needed. Now, for example, say that we need to store and process four chunks of 32-bit data. In the conventional representation, four 32-bit registers can be used to store all the input data. However, in the bitslicing technique, we use 32 4-bit registers to store and represent the data. In this representation, each register stores bits with identical indices from all of the four data chunks, meaning that register R 0 instead of storing 32 bits from the first data chunk, stores the Least Significant Bits from all of the four data chunks. Lets say that, b m n is the n th bit from the m th data chunk. In the conventional representation of these four 32-bit data chunks,
where i indicates the data index. However, with the bitslicing representation, register
This indicates the fact that each register stores bits with identical indices from all of the data chunks.
As already mentioned, in the conventional data representation, 4 32-bit registers are required to store our four data chunks. Now, in the bitsliced representation, we need 32 4-bit registers. We need 32 registers due to the fact that each of the original data chunks has 32 bits, and in the bitsliced representation, each register stores bits with the same significance from all of the data, hence 32 registers are needed to cover bits from LSB to MSB. The need for 4bit registers is due to the fact that each bit of information in each register is from one data chunk and by having four data chunks, we require only 4 bits. Figure 1 serves as an example model of the alteration in representation from the conventional order to the column- major bitsliced representation. It is worth noting that we have proposed the example only for illustration and explanation purposes and the size of the data chunks are not representative of the data chunks used in our CSPRNG implementation.
B. Random Number Generation
Truly random number generator processes are set to be non-deterministic, a condition under which the generated random sequence can not be determined in advance. Truly random number sequences can be generated from sampling of truly random sequences such as physical truly random phenomenon, including Thermal Noise [11] , Electrical Noise [12] , and Laser or Optical mechanisms [13] . However, such random number generators that use physical phenomenon are costly. Also, the unavailability of the required apparatus limits the scope of the usage for general applications. Although, such random sequences can be stored for later use which also limits the availability and security in certain applications.
The aforementioned issues of cost and availability lead to the use of digital computers in the generation of random numbers. Pseudo-random number generators are not truly random processes which roots from the deterministic essence of digital computers but are specifically designed to meet certain criteria of randomness in their generated sequences. One of the first PRNG methods that uses a random seed and relies on the randomness of the seed for the generation of reproducible random sequences is the Middle Square Method (MSM) [14] . With truly random seeds, PRNGs can generate random sequences until the seed is repeated and the sequence repeats in the output. The size of the initial seed indicates the size of the generated random sequence before the repeat in the generated sequence.
One great advantage of PRNG processes is that with the use of the same seed, the generated random sequence can be reproduced. However, it would also be computationally feasible to find the random input seed that the PRNG process is using to generate the pseudo-random sequence by exhaustively searching the seed space with a part of the sequence, to find and predict the next-bit of the sequence.
To counter this, the size of the seed must be increased in accordance to the increase in the performance of emerging computer systems.
C. Linear Feedback Shift Registers
Based on the mathematical foundation of cyclic codes over finite field of GF (2), the Linear Feedback shift registers have been employed both in software and hardware for a wide range of applications including transmission error checks [15] , high-performance counters and of course pseudo-random number generators. In LFSR, the feedback tabs which determine the next state of the system if combined linearly could directly impact the input of the system when the register is shifted at each clockcycle. Figure 2 demonstrates a basic representation of a single n-bit LFSR. The arrangements of the tabs could be represented by a polynomial referred to as the feedback polynomial.
Fig. 2: A basic n-bit Linear Feedback Shift Register
For a simple n-bit LFSR, the coefficients, the operations defined over GF (2) , and the reciprocal characteristic polynomial could be represented as is in Equation 1:
Also, it is worth noting that in many applications, in order to maximize the LFSR period length (i.e. 2 n − 1), a primitive polynomial is chosen as the tapping coefficients for the LFSR.
D. MICKEY 2.0 Stream Cipher
As already mentioned, stream ciphers are known as good candidates for CSPRG. Among all of various and different proposed stream cipher algorithms, some are specifically designed for efficient hardware implementation while guaranteeing acceptable level of security. The ECRYPT Stream Cipher Project (eSTREAM) Profile 2 stream ciphers are particularly suitable for hardware applications with restricted resources such as limited storage, gate count, or power consumption [16] . MICKEY 2.0 or Mutual Irregular Clocking KEYstream generator is the second generation stream cipher of the MICKEY family by Babbage and Dodd [17] . Armed with the fact that the MICKEY 2.0 algorithm is inherently light-weight in hardware implementation, its feedback shift register based architecture can be easily implemented with our proposed bitslicing technique.
The state machine of the algorithm consists of two 100bit shift registers, one linear and one non-linear, both clocked irregularly under the control of each other. The MICKEY 2.0 specific clocking mechanisms contribute to the cipher's cryptographic strength while allowing guarantees on period and pseudorandomness. It is stated that each key can be used with up to 2 40 different IVs of the same length, and that 2 40 keystreams can be generated from each key/IV pair. Figure 3 shows the Galois-based structure of the MICKEY 2.0 algorithm. In Figure 3 , R 
Where feedback Sets F B 1 and F B 0 are filled with constant values specified in the Mickey 2.0 specification. Also, control bits for R and S registers, denoted by CB r and CB s are derived from state values at each cycle and is governed by Equation 3:
These bits are responsible for the irregular clocking in the system which exploits some additional level of nonlinearity, preventing the statistical analysis of the cipher. The feedback procedure for each register is governed by Equation 4 , where the Comp 0 and Comp 1 sets are also chosen as constant sets in MICKEY specification.
Moreover, as illustrated in Figure 3 , input-bit R and input-bit S, changing at each cycle, are directly derived from the input bit, as shown in Equation 5 :
The initialization of the stream cipher, includes n number of IV-Load clock cycles, where the input-bit is derived directly from the n-bit Initial Vector, followed by 80 clocks of Key loading operation where the key bitstream is passed to the input bits. Before the keystream generation, the final stage of Pre-Clocking with 100 clocks is executed where the input is set to 0. Note that all stages except for key-generation are executed while the Mix value is set to 1. As shown in Figure 3 , the key stream, which in our case is considered as the pseudo-random bits, is derived by XORing the first bits of the S 0 and R 0 registers.
The MICKEY 2.0 designers have also specified a scaledup version of the cipher called MICKEY-128 2.0, which employs a 128-bit key and an initialization vector of up to 128 bits. It is stated in the specification that the irregular clocking mechanism makes the parallel implementation somehow not so straightforward. However, as will be thoroughly investigated later, our proposed bitsliced algorithm utilizes a fully parallel implementation of MIKCEY. Furthermore, it has been noted by Gierlichs et al. [18] , that straightforward implementations of the MICKEY ciphers are likely to be susceptible to timing or power analysis attacks. However, the system could be immunized by software techniques like masking, making these attacks significantly ineffective. Otherwise there have been no known cryptanalytic advances against MICKEY 2.0 or MICKEY-128 2.0 after its publication in eSTREAM. [19] III. Related Efforts Random number generation has been a topic of interest for researchers and developers for decades. Numerous theoretical and practical studies have investigated the complexity of generating high quality random numbers and evaluating them [20, 21] . Making use of parallel platforms for acceleration of RNGs has also been a matter of consideration in the literature [22] . Staring around the 2000's, the emergence of general-purpose computing on graphics processing units has opened new horizons for high-performance generation of random numbers. In 2006, M Sussman et al. published one of the first works utilizing the power of GPGPU on the subject of RNG [23] . In the years to follow, many researchers and developers reported successful implementations of PRNGs with increase in performance on parallel platforms, achieving remarkable speedups over the CPU platforms of their time and outperforming similar efforts [24, 25] . In the subject of highperformance parallel PRNG, the performance of Nvidia's proprietary PRNG cuRAND library [26] has always been a forceful competition, still in some cases researchers have reported their work excelling the performance of the cuRAND of their time [27] .
Regardless of all the performance advancements in the HPC community, there have always been skepticism and critical opinions regarding arithmetic methods for generating random numbers. Highly favored from physics academic community [28] quote by Von Neumann stating "Anyone who considers arithmetical methods of producing random digits is, of course, in a state of sin" [14] is famously cited to designate the difference between natures of generated random numbers by physical and computational methods. Although, not truly random numbers by definition, computationally generated random numbers have gained serious attention in recent years due to their accessibility, affordability, and their ability of employment in high-performance platforms thanks to the development in parallel hardware devices. Theoretical and practical methods regarding generation and evaluation of computationally generated random numbers have advanced so far that allows these arithmetically generated random numbers to be securely utilized in sensitive applications like cryptography.
It is worth noting that the quality of random numbers is dependent on its target user and application domain [20] . Trivial computer games and sensitive cryptographic applications have different requirements and criteria for measurement of how "good" a random number is. Therefore to ensure that the randomness properties of a sequence is satisfactory for a certain application, the measurement of the quality of the generated sequence is of high significance. Consequently, efforts have been made to develop procedures and tools capable of evaluating the desirable statistical properties of a given random sequence [29] . NIST SP 800-22 [19] is a statistical test suite from the national institute of standards and technology (NIST) designed to probe RNGs for both statistical and cryptographical properties, ensuring the qualification of the passing RNG for its target usage in cryptographic applications.
Taking advantage of the GPU platform in generation of random numbers in comparison to the physical and the optical methods [13, 30, 31] , other hardware platforms [32] such as FPGA [33] and CPU [34] enables the users to strike a balance between the quality of randomness, flexibility, obtainability, affordability, performance, and outstanding performance per cost metrics. Unfortunately, current methods of random number generation on GPU do not take full advantage of this hardware platform. Latest efforts on CSPRNG on high-end GPUs perform poorly in utilizing the massive parallelization capabilities of GPU in order to reach high generation performances [38] . Table I represents some of the related efforts and their obtained performances in terms of throughput. The performance achieved by the state of the art RNGs such as Nvidia's proprietary cuRAND library still does not fully exhibit the full potential of modern GPUs. We believe this performance can be further enhanced while maintaining a reasonable cryptographically properties via applying the bitslicing approach to the implementation of GPU-based CSPRNGs.
IV. Proposed Method
In this section, we propose the implementation of the bitslicing software technique for high-throughput cryptographically safe pseudo-random generation which is based on Mickey 2.0 algorithm. The approach of column-major bitsliced data representation scheme is firstly introduced and thoroughly discussed and their advantages compared to the naïve implementation are explained. Afterward, by incorporating the bitslicing technique into the implementation of our work, a high-throughput algorithm for LFSR architecture is presented and then, the novel implementation of parallel MICKEY algorithm is presented for Random Bit Generation (RBG). Finally, further GPU optimization techniques used in our implementations are discussed.
A. Bitsliced SIMD Vectorization and Data Representation
Bitslicing technique was employed by Biham [10] for implementation of cryptographic algorithms. At the time, the technique was able to speedup the previous implementations of the Data Encryption Standard (DES) by 5x to accelerate the exhaustive search procedure. As mentioned, by the emergence of high-performance, affordable general purpose GPUs, the bitslicing technique has been successfully employed by many works such as [39] , [40] , [41] , and [5] as a software solution for high-throughput demanding cryptographic applications on GPUs. This bitslicingbased implementation, leveraged by the column-major data representation, reaches an unprecedented throughput of Terabits per second (Tbps) both in encryption and decryption.
Here, before getting into the details of the bitsliced implementation of the proposed PRNG via MICKEY 2.0, we discuss the data representation scheme employed in our work which could lead to a remarkable performance for random bit generation. The proposed representation scheme not only perfectly suits the parallel architecture of the GPU, but also, fully utilizes the available datapath of the computational units in the deployment hardware.
Our proposed data representation scheme, uses columnmajor data representation, instead of the conventional row-major representation. By the row-major representation, we refer to the representation used in the way that the data is stored in common programming practices. In our implementation, we store state bits and other supplementary and temporary registers in the columnmajor representation. By doing so, we strive to achieve full SIMD execution of a number of 32 (in the case of single precision calculations) bits from different data chunks at each execution clock cycle. In the case of our LFSR implementation, a batch of 32 bits data stored in a single register, represents state bits from 32 uncorrelated different parallel LFSRs having the same bit significance. Hence, As shown in Figure 1 , the first step is to alter the representation to the column-major data representation.
For a simple LFSR implementation operating in the conventional row-major representation, one or more registers are used to store the state bits of the LFSR algorithm. Hence, in order to store the n-bit LFSR states with a primitive polynomial (feedback polynomial) of p(x) = n−1 i=0 a i x i , a number of n m registers are needed to store LFSR state bits. For instance, for a simple 20-bit LFSR, assuming single precision operations, a single register of 32-bit width is employed to handle the computation of the LFSR state machine.
As thoroughly investigated in the previous section, the shift operation is inherent to the LFSR architecture, and in the conventional naïve implementation, costly bit-level shift and mask operations are mandatory at every single rotation of the LFSR state machine. This would considerably limit the overall performance of the RNG circuit, since these bit-access operations should be executed at each rotation. Moreover, in some scenarios, the register utilization in terms of datapath width of the platform cannot be maximized due to the unused number of bits in the conventional row-major data representation. However, our proposed column-major bitsliced data representation, not only compensates for the aforementioned shortcomings inherent to the common practice naïve implementation, but also maximizes the utilization of the processing units in the GPU.
B. Bitsliced LFSR Implementation
Along with the fact that the costly shift and rotation operations can be further reduced to simple register swapping operations in the bitsliced data representation, here, we investigate the underlying architecture of our proposed bitsliced LFSR implementation on GPU which will be employed for the MICKEY 2.0 algorithm. Moreover, we indicate the properties and advantages of our proposed bitslicing technique accompanied by the columnmajor data representation. As shown in Figure 8 and explained earlier the LFSR conventional implementations suffer from heavy bit-level shift and mask operations. For fair comparison and for the sake of simplicity, consider the naïve implementation of 32 parallel LFSRs governed by the primitive polynomial g(x) shown in Equation 6. Note that there are at least k number of feedback paths in the LFSR algorithm.
As is illustrated in Figure 4 , each of these parallel LFSRs are handled by a single thread which results in the execution of 32 parallel threads. Hence, to generate a total number of M pseudo random bits, each LFSR module should be shifted for M/32 times where a number of 32×k bit-level XOR operations are needed. It is worth noting to know that to use parallel LFSRs in this manner, the shift-registeres should be carefully initialized to eliminate any statistical correlation between the LFSR state machines when the output is not mixed (it is highly recommended to use non-linear mixing before generating the bit stream). Moreover, from the cryptanalysis point of view, the secure threshold for the repeat period (not 2 n − 1 in this case) of the employed parallel system should be estimated. Considering the same scenario for pseudo-random bit generattion by the use of LFSR, in Figure 5 the bitsliced LFSR implementation with our proposed column-major representation is demonstrated. Compared to the previous conventional model, to generate M bits in this proposed methodology, the same number of M/32 LFSR shift cycles are required. However, this procedure could be executed by a single thread. Also, it is worth mentioning that the 32 × k number of costly bit-wise XOR operations (needed at each cycle) is reduced to k number of fullwidth XOR operations. These operations maximize the datapath utilization. Moreover, as shown in Figure 5 , the costly bit-level shift operations are replaced by cheaper and more trivial register swapping operations which can be easily done by changing the references of the registers in the software code. Although, changing references in code might be a burdensome task, it greatly reduces the number of needed instructions in the code. Similarly, in this case the registers should be safely initialized from the perspective of cryptanalysis and the period of the usage should be considered. Note that to maximize the repeat period of the LFSR algorithm for PRNG, it is recommended to choose an LFSR with a higher n value.
C. MICKEY 2.0 Bitsliced Implementation
As explained, MICKEY 2.0 stream cipher is comprised of two 100-bit registers, namely S and R registers. By in- corporating the bitslicing technique into the implementation of the MICKEY 2.0 algorithm, instead of two 100-bit registers, the data representation is altered into columnmajor order and 200 registers each containing 32 bits are employed. Note that our implementation utilizes single precision computation which occupies 32-bit registers. By doing so, 32 parallel Mickey stream ciphers are executed simultaneously. Figure 6 demonstrates the parallel bitsliced Mickey architecture. Rreg i and Sreg i represent the i th bits of the R and S registeres in the bitslicing manner, respectively. Each of these registers stores 32 different bits of the same significance for 32 parallel LFSRs modules. Hence, in our implementation each GPU thread is capable of executing 32 parallel Mickey 2.0 ciphers and 32 random bits are generated by each thread at each clock cycle. Also, note that here, the XOR operation is executed on two 32bits registers and the register-based operations are fully utilized compared to the naïve implementation.
To securely and properly initialize our bitsliced Mickey algorithm, we employ a non-linear function to expand a carefully selected pre-stored random number set which generates an 80-bit Initialization Vector (IV) for each MICKEY module (32 × 80 bits of IV for each thread). It is worth noting that the controller bit functions are designed in the bitsliced representation to calculate all the 32 bits of the controller bits for the feedback procedure. This bitsliced controller is fully optimized to compute the underlying parameters given in Equation 3 . Moreover, the input handler is also executed in 32-bit width mode with no additional overhead and is fully optimized in terms of datapath width. 
V. Evaluation
In this section, we will present the evaluation results of the performance and the performance per cost metrics achieved from the execution of our proposed CPRNG implementation on a number of different CUDA-enabled GPUs. In this study, five Nvidia GPUs are deliberately selected for evaluation purposes. The employed GPUs each have different structural characteristics such as different single and double precision throughputs and memory bandwidths. These features are carefully selected to represent a wide range of execution platforms. We selected these GPU platforms because of the fact that firstly, the range of the selected GPUs completely represents the platforms available to a wide range of users spanning home and enterprise users. Secondly, these GPUs give us a fair comparison with the previously proposed methods. Moreover, we demonstrate the randomness robustness and reliability of the generated bits by discussing the NIST statistical test results for our implementations.
A. Setup
This section elaborates on the specification of the hardware platforms used for the evaluation of our proposed method. GPU platforms GTX 980, GTX 980 Ti, and GTX 1050 Ti on systems with two Intel XEON E5 2697 V3 CPUs clocked at 2.6 GHz and 128 GB of DDR3 RAM were used for the evaluation process. Moreover, to prove the scalability of the proposed method Kepler K20 and GTX 480 GPU are also utilized for evaluation which are employed in a Virtual Machine environment with the same virtually dedicated specifications. Table II shows the specification details of the aforesaid GPU platforms in terms of the processing power and the memory bandwidth. B. Performance Figure 7 illustrates the achieved performance for our proposed method on different GPU platforms. The best result obtained on GPUs 980 Ti is acquired in the following manner of executing the implemented CUDA kernel code with fixed parameters of thread blocks and thread per block set to 64 and 256, respectively. The loop size of the code is varying between 4,400 to 13,000, yielding to a different performance throughput. The cuRand results here denoted by cuR 480 and cuR 980Ti are evaluated using the Mersenne Twister algorithm as the defualt cuRand method for RNG. Also note that the implementation results on Kepler K20 is based on a 64-bit bitsliced architecture. This is done due to the fact that K20 is far more efficient in the double precision computation while other employed GPUs are designed for applications with single precision operations. 
C. Normalized Performance Evaluation
Due to the lack of access to some of the GPU platforms used in previous works that are currently outdated platforms and to deliver a fair comparison, we follow the conventional method of normalizing the results of our proposed method and related works on parameters of performance per processing power which is shown in Figure 8 .
D. Statistical Tests
We use the NIST SP 800-22 version sts-2.1.2 for testing the statistical and cryptographical properties of our generated random sequences. Bitstreams generated by our implementation successfully pass all statistical tests. As recommended by the NIST tests, the items are executed using 1,000 instances of 1Mb of random bits generated by our solution. Note that the significant value here is considered to be α = 0.01, and the results of P − V alue verify the randomness of the input bitstream. 
VI. Discussion & Conclusion
In this work, we propose a high-throughput fully parallel cryptographically secure pseudo-random number generator using the bitslicing technique. In this technique, the data from the conventional row-major representation is altered into column-major representation for the purpose of full utilization of the computation datapath of the employed device. Using the bitslicing technique on the LFSRbased cryptographically secure MICKEY 2.0 stream cipher allows of high-performance random number generation by the elimination of the shift and mask operations. Various supplementary techniques such as utilization of shared memory and coalesced memory accesses are also employed to further increase the performance. One of the main concerns of employing GPUs for generation of random numbers is delay, which compared to the similar computational methods including ASIC, FPGA, and physical methods such as optics may be considered as the major drawback of these relatively general purpose computational platforms. The proposed method can prove tremendous advantageous when employed on applications where slight delay is not a matter of great concern and the performance and the cost efficiency of the solution are considered. Our proposed methodology achieves the outstanding throughput of 5.2 Tb/s on Nvidia GTX 980 Ti, outshining the Nvidia's proprietary cuRAND library by 6.6x while striking a notable balance in criteria of performance per cost.
