Abstract-A binary de Bruijn sequence with period 2 n is a sequence in which every tuple of n bits occurs exactly once. De Bruijn sequence generators have randomness properties that make them attractive for pseudorandom number generators and as building blocks for stream ciphers. Unfortunately, it is very difficult to find de Bruijn sequence generators with long periods (e.g., 2 128 ) and most known de Bruijn sequence generators are computationally quite expensive. In this article, we present "OcDeb-k-n" and the first hardware implementation of de Bruijn sequence generators. OcDeb-k-n efficiently computes a composited de Bruijn sequence where k levels of composition are added to a de Bruijn sequence of period 2 n . Numerically, OcDeb reduces the bit operations used for computing the feedback function significantly from Qðk 2 þ nkÞ to Qðk log k þ log nÞ. Furthermore, it enables efficient parallelization and hardware retiming. Comprehensive result analysis is conducted for 65 nm ASIC technology. For example, OcDeb-32-32 has an area of 643 GE with 1.45 Gbps performance, and with parallelization it generates up to 25.4 Gbps at the cost of 4,787 GE. The area of OcDeb-512-32 generating a de Bruijn sequence of period 2 544 is 7,304 GE and the performance is 1.25 Gbps.
Ç

INTRODUCTION
I N cryptography, pseudorandom number generators (PRNGs) are an essential component in cryptoprimitives such as stream ciphers, digital signature algorithms and cryptographic protocols. A good PRNG is characterized by assessing its output sequences' randomness properties such as long period, balance, equal distribution of runs, ideal t-tuple distribution, ideal two-level autocorrelation, low crosscorrelation, and high linear complexity [5] , [11] , [12] , [21] . The NIST test suite SP800-22 for cryptographic applications includes these statistical properties [22] . Feedback shift registers (FSRs)-a technique for generating sequences recursively-are often used to construct PRNGs and stream ciphers due to their potential for efficient hardware implementations. There are two types of FSRs, linear feedback shift register (LFSR) and nonlinear feedback shift register (NLFSR). The theory of LFSRs is well-studied [12] , but few results are known for NLFSRs. LFSRs that generate m-sequences have some good randomness properties, such as long period, balance, ideal run and t-tuple distribution, two-level autocorrelation, and three-valued crosscorrelation; but the unpredictability of LFSRs' outputs measured by the linear complexity is relatively low [12] . On the other hand, an NLFSR that generates a maximum period sequence, known as a de Bruijn sequence, has the following randomness properties: long period, balance, ideal tuple distribution and high linear complexity.
However, some span-n sequences constructed from de Bruijn sequences have very low linear complexity. Previous techniques for generating de Bruijn sequences of large period, based on NLFSRs and algorithmic approaches, are inefficient. This paper presents OcDeb, a family of efficient de Bruijn sequence generators. OcDeb-k-n, generating a sequence with period 2 nþk , is a composited de Bruijn sequence generator, where k levels of composition are added to a de Bruijn sequence generator of period 2 n . OcDeb requires significantly fewer bit operations to output one bit than known techniques and it scales better by being almost linear in k and logarithmic in n. Besides, it can be efficiently parallelized, and by retiming, can achieve almost constant clockspeed regardless of increasing k and n.
Many stream ciphers and PRNGs use FSRs as building blocks [2] , [3] , [4] , [14] , [15] . For instance, all three eStream profile 2 finalists (Grain, Trivium and Mickey) use FSRs as building blocks in their constructions [3] , [4] , [14] , [15] . The long periodicity of an output sequence of a PRNG is a salient property. Currently, LFSRs are often used in designs to ensure the exact or a lower bound of the period, e.g., in [14] and [2] , and NLFSRs are used to introduce nonlinearity in the designs. It is believed that the NLFSRbased designs are resistant to algebraic cryptanalytic attacks such as algebraic attacks and cube attacks. But their output sequences do not guarantee any randomness property. The goal of our work is to develop PRNGs that have the maximum period, and so inherit the balanced property; have high linear complexity, which means their output sequences are difficult to predict; and are efficient to implement. These PRNGs are intended to be combined with a filtering function and used in cryptographic primitives such as stream ciphers, cryptographic PRNGs, digital signature generators, etc.
A binary de Bruijn sequence of length 2 n contains exactly one instance of each n-tuple in one period. Furthermore, every t-tuple (1 t n À 1) occurs equally likely. An n-stage NLFSR can be used to generate a de Bruijn sequence. De Bruijn sequences have the maximum possible period of 2 n for n-bits of internal state and a linear complexity of at least 2 nÀ1 þ n [5] . Finding a simple and easily implementable feedback function with good randomness for a de Bruijn sequence with a long period is computationally infeasible. The largest de Bruijn sequence generator found so far has a period of 2 32 [10] . Previous algorithmic techniques [7] , [8] , [9] , [13] , [24] to construct de Bruijn sequence generators result in PRNGs that are unreasonably large and slow. The composited construction [18] , [20] is a technique for generating de Bruijn sequences based on an NLFSR that can generate sequences of arbitrarily long period and high linear complexity. These features motivated us to base our work on the results in [18] , [20] .
OcDeb-k-n stands for Optimized composited De Bruijn sequence generators parameterized by k and n, which is a family of NLFSR-based PRNGs that are functionally equivalent to composited de Bruijn sequence generators. However, they are significantly more efficient: achieving a 45Â improvement in the performance/area ratio for a generator with period 2 64 , and the improvement is more pronounced for larger instances in the family. Asymptotically, the number of bit operations used to compute the feedback function is reduced from Qðk 2 þ nkÞ to Qðk log ðkÞ þ log ðnÞÞ. Additionally, OcDeb can be efficiently parallelized. A few highlights: OcDeb-32-32 has an area of 643 GE 1 with 1.45 Gbps performance and after parallelization it generates up to 25. 4 Gbps with an area of 4,787 GE. OcDeb-512-32 has an area of 7,304 GE with 1.25 Gbps performance. These results are after logic synthesis for 65 nm ASIC technology.
The rest of the paper is organized as follows. Section 2 provides background on de Bruijn sequences and Mandal and Gong's composited construction. Sections 3 and 4 prove that our OcDeb algorithm is functionally equivalent to the composited construction. Section 5 describes how to efficiently parallelize OcDeb. Section 6 describes and proves the correctness of a retiming optimization. Section 7 analyzes performance and area results on OcDeb and demonstrates that OcDeb passes the NIST SP800-22 tests.
BACKGROUND
Notations
To make our math consistent with our figures, we write vectors and ranges of natural numbers in decreasing order. To minimize notational clutter, we overload operations on elements of F 2 , Booleans and binary (1, 0) .
The integers from j down to i inclusive. v i
The i th element in vector v.
The vector of elements j down to i of vector v. v ½:i
The elements from the end of v down to i. jvj
The length of vector v.
The set of integers from 0 to n À 1.
The set of odd (even) integers from 0 to n À 1.
db-ðk; nÞ
Order-k composited construction on top of a de Bruijn sequence generator of period 2 n . OcDeb-k-n Optimized db-ðk; nÞ. x
Reserved symbol for shift register, x ¼ x ½:0 ¼ x ½nþkÀ1 : 0 for db-ðk; nÞ and OcDeb-k-n.
The j-composition of x i , located at the i th column, j th row in figures. FeedbackIf an n-stage FSR's feedback function is fðxÞ, its expression feedback expression is x n þ fðxÞ. Conversely, given a feedback expression gðxÞ, the feedback function is gðxÞ þ x n .
&
We use gray to mark a set of nodes that make up an input to a function in figures.
Composited Construction of de Bruijn Sequences
We start this section by providing definitions of de Bruijn sequences and span-n sequences.
Definition 1 (de Bruijn sequence). A binary sequence (a)
with period 2 n is a de Bruijn sequence iff every length-n subsequence (a ½iþnÀ1 : i ) occurs exactly once in one period.
Definition 2 (span-n sequence). [11] A binary sequence (a) with period 2 n À 1 is a span-n sequence iff every non-zero length-n subsequence (a ½iþnÀ1 : i ) occurs exactly once in one period. It is also called a modified de Bruijn sequence iff it is obtained from a de Bruijn sequence with period of 2 n by deleting one of the 0s in the subsequence of n 0s.
For a detailed description of composited construction, the reader is referred to [20] . The presentation of composited construction here is made close to hardware implementations and set up for the ease of introducing OcDeb.
The composited construction starts with a de Bruijn sequence generator of period 2 n and then adds k levels of composition and a large sum-of-products term to the original feedback expression in order to construct a de Bruijn sequence generator of period 2 nþk . Assuming Fig. 1 is a span-n sequence generator, we create a de Bruijn-sequence generator with period 2 n (Fig. 2 ) by adding All0ðx ½nÀ1 : 1 Þ, which tests if all bits x ½nÀ1 : 1 are 0 (Definition 3). In the figures, the function G is drawn with all n À 1 bits of x ½nÀ1 : 1 as inputs. In reality, G almost always is a function of a relatively few bits (e.g., 14 bits for the db-(32, 32) example in this paper). The k levels of composition is achieved by compositing the original feedback expression with the function j À k ðÁÞ, which is recursively defined in Definition 4.
Definition 4 (Function j
ÀðxÞ and compositing it with itself). 
After k levels of composition in the feedback expression x n þ Gðx ½nÀ1 : 1 Þ þ x 0 þ All0ðx ½nÀ1 : 1 Þ, all elements of x are replaced by their k levels of composition, e.g., x i is replaced by j À k i . The All0 function becomes All0 k (Definition 5) to express this k levels of composition. Besides, a sum-of-product term (see Definitions 7 and 8) is added. It checks if an odd number of vectors of j À j i ðxÞ, where i 2 ½n þ kÀ 1 À j; 1, j 2 ½k À 1; 0 have an alternating sequence of 1 and 0 s. The feedback expression after k levels of composition is in Definition 8. In the remainder of the article, the input vector to a function is sometimes left out when there is no ambiguity in deducing it from the context. Definition 5. All0 k ðxÞ ¼ 1 iff all of the elements from the shift register x after k levels of composition are 0
We use 'x' exclusively for the shift register. In Definition 6, we use the symbol 'v' for the input vector to X because as shown in Definition 7, an element in the input vector of X may not be a shift register element. For example it could be j Definition 7. X j ðxÞ ¼ 1 iff all elements with even indexes from x after j levels of composition are 1 and all elements with odd indexes after j levels of composition are 0
Definition 8 (feedback expression of db-(k; n)). ccdB k n ðxÞ is the feedback expression of db-ðk; nÞ whose starting de Bruijn sequence generator's feedback expression is
The Mesh Structure
For easy reference, we call the set of j À j i instances which are examined by mesh k n , that is f j À j i j j 2 ½k; 0; i 2 ½n þ k À 1 À j; 1g, "the mesh structure" or "the mesh". We call j À j i a "node" on the j th row and i th column of the mesh. As mentioned before, we can calculate each node in the mesh either recursively or directly. By computing all nodes recursively, we build a union of XOR trees, which will be called an XOR mesh hereafter. For example, Fig. 5 shows the computation of the mesh for db-ð6; 4Þ using an XOR mesh. In Fig. 5 , x 10 has the feedback value and it appears implicitly in the term j À 6 4 ðxÞ. We factor it out by adding x 10 . From the figure, we can also tell mesh 6 4 contains almost all of the overhead from composited construction and is the most expensive part of the feedback function.
We pick the ðk ¼ 6; n ¼ 4Þ composited de Bruijn sequence generator as a running example to show the original construction (i.e., db-(6, 4)) and to introduce OcDeb-6-4. Fig. 5 shows a straightforward implementation of db-ð6; 4Þ, where one of the components is an XOR mesh that computes all the mesh nodes that are inputs to G, All0 and X. We call it the mesh-based implementation. Proof. For db-ðk; nÞ, there are ðn þ k À 2Þ XOR gates at the bottom, n þ k À 2 À j XORs on each row j, and ðn À 1Þ XORS at the top, resulting in a total of ð2n þ k À 3Þk=2 XORS. Other components of this mesh-based implementation, e.g., All0, X and G, do not have higher complexity. Thus, the overall number of bit-level operations is Qðk 2 þ nkÞ. t u
Baseline Implementations
An alternative implementation also includes computing all the mesh nodes, but instead of utilizing an XOR mesh, each node is considered separately and computed in the direct way. For example, in Fig. 4 , we show how to compute j À 4 0 in the direct way. The number of XOR operations used to compute j À j i in the direct way is given in Lemma 2. This alternative implementation requires the minimum number of XORS for each node. But when all mesh nodes need to be computed, our experiments show that it costs more XORS than the mesh-based implementation. Intuitively speaking, it is because, similar to a tree of XORS, an XOR mesh also reuses XORS to compute instances of j À j i . A more general term for this is common subexpression elimination. We choose the mesh-based implementation as a baseline.
Lemma 2. j À j ðxÞ is equal to a sum of 2
HðjÞ elements from the shift register x, where HðjÞ is the Hamming weight of the binary representation of j.
We omit the proof of Lemma 2. The reader is referred to [23] .
In [20] , Mandal and Gong show how to compute mesh k n iteratively, which is essentially a serialization of a meshbased implementation. Each row and its associate function (that is G and All0 for the top row, X for the rest) is computed in one clock cycle. In total, ðk þ 1Þ clock cycles are needed to compute mesh k n . As nodes from a row in the mesh are only dependent on the nodes from the row below, we only need ðn þ k À 1Þ extra flip-flops to store the nodes' values from the row below. The number of bit operations in one clock cycle is Qðk þ nÞ, therefore the overall complexity to compute mesh k n is Qðk 2 þ nkÞ.
INITIAL CONSTRUCTION
The most computationally expensive part of the feedback function is mesh k n ðxÞ. In order to efficiently compute the feedback function, we construct an equivalent but significantly simpler formula for mesh k n ðxÞ. The key insight for constructing a simple formula is to recognize unnecessary nodes in the mesh, which is detailed in Section 3.1. We can decompose a composited function whose input is a partial shift register into a simpler and more meaningful function operating on nodes from the mesh, e.g.,
Þ. In Section 3.3, we introduce transformation matrices that transform partial shift registers to vectors of nodes in the mesh. One of the advantages of having a matrix is that we can use it to represent a vector of nodes more compactly, e.g., j
In Section 3.4, we present one equivalent yet much simpler constructed formula for mesh k n ðxÞ and show that the number of bit operations used to compute it is Qðk log ðkÞ þ log ðnÞÞ, which is much better than the original formula's Qðk 2 þ nkÞ. In Section 4, we improve on it and introduce more formulae that have almost the same efficiency. 2 Collectively, they become a class. Interestingly, this class enables other optimizations, such as efficient parallelization in Section 5 and hardware retiming to build high-speed circuits in Section 6.
OcDeb-k-n is built upon a class of constructed simple formulae for mesh k n . We assume both n and k are even when describing and analyzing our construction. For other cases, the construction and analysis are analogous.
Preview of OcDeb-6-4
Before getting into details of the construction, let us explain the construction at a high level with an example. We will revisit each function in the figure in detail as we move along. From now on, details of the computation of a subset of mesh nodes, such as M ET in Fig. 6 , will be omitted from the pictures for brevity.
Intuition Behind Our Construction
Let us take mesh 6 4 as an example. From Fig. 5 we can tell that based on the original formula of it, we need to compute all the nodes in the mesh, which is costly. This becomes worse when k and n get larger, for example, when k ¼ 40; n ¼ 24, 1,700 XOR gates are used in computing the mesh (according to the proof of Lemma 1).
We can also tell from the same picture that all nodes are linearly related and by knowing the values of any 9 independent 3 nodes, we are able to deduce the others in the mesh. Therefore, it is viable to keep only nine independent nodes and have all redundant nodes removed in a formula for mesh 6 4 . Two criteria for choosing the independent nodes are: i) they are relatively cheap to compute. ii) the formula used to compute mesh 6 4 is simple. As shown in Fig. 6 as will be explained in this article, one of our choices for the independent nodes is the leftmost two nodes on even rows and the top row nodes. The functions that we construct that use the nodes (i.e., SomePatE e , All0 and All1) are efficient.
Transformation Matrix
We have seen the use of matrices in Section 3.1. Because all nodes in the mesh are linearly dependent on the shift register, m nodes make m linear equations, with the left hand sides being the m nodes and right hand sides being expressions of elements from the shift register using only XORs. These m linear equations can be rewritten into a vector equation or a matrix equation. Therefore, we can always find the corresponding matrix for a vector of any nodes in the mesh.
The computation of nodes in the mesh is ubiquitous in functions defined due to compositions with j ÀðÁÞ. By introducing transformation matrices, we make that computation explicit and give it a compact representation, e.g.,
An informal definition of M T along with other constant matrices of interest and a matrix operator are as follows. For db-ðk; nÞ or OcDeb-k-n, recall that the length of the shift register x is ðn þ kÞ,
The leftmost nodes on all rows, referred to as diagonal nodes hereafter:
The leftmost two nodes on even rows plus the leftmost node on the top row:
x ½nÀ1 : 1 M T The top row of nodes in the mesh:
Take M's columns between j 2 and j 1 inclusively, where j 2 ! j 1 and counting from right starting from 0. Now we show by example what these special matrices look like and demonstrate that one matrix is determined by the partial shift register multiplied by it and the nodes to generate in the mesh. 
To calculate the constant matrix M T when k ¼ 6; n ¼ 4, based on Definition 4, we expand the following nodes
Similarly, we can get M E by expanding j À 0 9 ; j À 0 8 ; . . .. One important thing to note about implementing these special matrices is that we use the expanded expressions of the mesh nodes to compute them, e.g., the implementation of M ET in Fig. 6 . In this way, we do not store matrices.
Furthermore, a byproduct of a transformation matrix is that it provides an easy way to test independence in linear equations thus also independence in nodes. Assume that there is a transformation matrix M that maps m bits of the shift register to m nodes in the mesh, then these m nodes in the mesh are independent iff the determinant of M is non-zero. 3. One way to check independence will be shown in Section 3.3.
Our Initial Construction
We present our initial construction of a simpler formula for mesh k n . There are two key insights behind the construction: i) At most one row can satisfy X (have an alternating 1, 0 pattern). ii) We can detect whether there is a row satisfying X by looking at just the top row and the leftmost node of every row. Together these optimizations allow us to look at only the leftmost node of each row and the top row of nodes. Optionally, we can further exploit a trade-off between storage and computation by using a counter to detect if the top row nodes are all 0s and all 1s. In ASIC implementations, this proves to be beneficial when the throughput is 1 bit/cycle but becomes harmful when throughput increases. Looking at Fig. 7 , intuitively we know that if a row k 1 satisfies X, then the row above it is all 1s and all higher rows are all 0 s, which prevents a second row from satisfying X. With Lemma 3, we have reduced the problem of determining whether an odd number of rows satisfy X to determining whether there is any row that satisfies X.
Diagonal Optimization for a Specific Row
We now show that determining whether a row k 1 satisfies X can be done by checking if the top row is all 0 s and if the diagonal from row k 1 and above satisfies a particular pattern (Lemma 4). Lemma 4 is the first step towards removing unnecessary nodes (they are the nodes that are not on the diagonal nor the top row) and their k associated functions XðÁÞ. Theorem 1 will complete it.
We define PatDðÁÞ (Definition 9) to examine its input vector for a particular pattern. But first we introduce colðÁÞ which returns the column at which an element from a vector actually resides in the mesh colðx i Þ ¼ i; if x i is a bit from the shift register. 
To prove the forward direction of Lemma 4, (with the help of Fig. 7 ), we know that if row k 1 satisfies X, then row ðk 1 þ 1Þ is all 1s and all higher rows are all 0s. To prove the reverse direction, from All0 k 2 ðv ½n 2 : 1 Þ ¼ 1 and PatDðx ½:nÀ1 M D½n 1 Ànþ1 : n 2 Ànþ1 Þ ¼ 1, we are able to deduce the value of every node in the mesh between row k 1 and k 2 inclusive, after which we can verify the nodes on row k 1 satisfy X k 1 ðx ½n 1 : 1 Þ ¼ 1.
Diagonal Optimization for the Complete Mesh
We now combine Lemmas 3 and 4 to prove Theorem 1, which reduces the problem of determining if the mesh satisfies mesh k n to just examining the top row and the diagonal. Before presenting Theorem 1, we define SomePatD k n ðxÞ, which checks the existence of a diagonal segment that satisfies PatD.
Definition 10. SomePatD k n ðxÞ ¼ 1 iff there is a diagonal segment starting at a row between 0 and ðk À 2Þ inclusive, let us say k 1 , and ending at row k such that PatDðx ½:n M D½n 1 Àn : 0 Þ ¼ 1, where 
The top row (row k) is all 0s and none of the rows between 0 and ðk À 2Þ inclusive is the starting point for a diagonal segment that satisfies PatDðÁÞ, that is SomePatD k nÀ1 ðxÞ ¼ 0. The top row is all 1s and the head node on row k À 1 (i.e., j
We now sketch the proof of Theorem 1. Using Lemma 3, which says that at most one row satisfies X
ðxÞ is 0, which matches the top line of the right-side of Theorem 1. For the second case, the only possible row that can satisfy X without causing the top row to be all 0s is the row just below the top row. It is not hard to see this row satisfies X iff the top row is all 1s (All1 k ðx ½nÀ1 : 1 Þ ¼ 1) and the head element of this row, i.e., j À kÀ1 n , is 0 under the assumption that n is even.
Using a Counter for All0, All1
Up to this point, we have eliminated the need to examine any of the interior nodes (nodes that are not on the diagonal nor the top row) in the mesh. Here we provide a different way to compute All0 and All1 besides computing it explicitly. Because the bottom of the mesh is a shift register, each node on a row of the mesh will have its left neighbor's current value in the next cycle. We can replace the explicit test that the ðn À 1Þ nodes on the top row are all 0s (or all 1s) with a simple counter that counts the number of consecutive cycles that the head node in the top row is 0 (or 1) and is compared with ðn À 1Þ. This optimization demonstrates a trade-off between storage and computation. As will become clear in Section 5, we should resort to explicitly computing All0 and All1 when the composited de Bruijn sequence generator is parallelized. Based on Theorem 1 and Definition 10,
Then if we use a counter in the computation of All0 and All1, we can further rewrite the above formula into
Fig
The Complexity
Recall from the previous section that the simple formula for mesh
Because every software/hardware compiler has the capacity for common subexpression elimination, in order to have a better estimation of the actual implementation complexity, we study how to apply this technique on top of our simple formula (1), and take it into account while calculating the complexity of the feedback function. We generalize the above example and without loss of generality, we assume that we compute the diagonal nodes from the lowest row to the highest. For each diagonal node, the net XORs required is (total XORs) À (shared XORs with diagonal nodes below it). Let us say we want to compute the diagonal node on row k 1 (i.e., j À k 1 n 1 ) and k 1 can be written as a sum of powers of 2
where Hðk 1 Þ denotes the Hamming weight of k 1 . j
shares common subexpressions with the diagonal nodes below it on the rows k 1 À 2
. . . ; k 1 À 2 i Hðk 1 Þ , which are:
Based on the proof of Lemma 2 in [23] , there is one-toone correspondence between indices of elements from the shift register and the objects from the power set of f2 i 1 ; 2 i 2 ; . . . ; 2 i HðkÞ g. The one-to-one mapping, denoted as set2index, is defined as: i) set2indexðfnum1; num2; . . .gÞ ¼ num1 þ num2 þ . . . and ii) set2indexðfgÞ ¼ 0. 
Similarly, to make up for the difference in the subscripts, we look for objects from the power set that have 2 i 2 in them. We do not want 2 i 1 in these objects, because we just considered this case. There are 2 HðkÞÀ2 such objects of the power set, corresponding to the 2 HðkÞÀ2 shift register elements that are shared by j À 
Theorem 2. The number of bit operations used to compute the feedback function of OcDeb-k-n is bounded by Qðk log ðkÞþ log ðnÞÞ.
Proof. mesh k n in the feedback function of OcDeb-k-n is shown in (1) and it dominates the complexity.
We first compute the cost for computing the diagonal nodes (i.e., implementing M D ), denoted as costðdiagÞ. Let lenðkÞ be the minimum number of bits required to represent k. Then, due to (2), The equality holds only when k þ 1 is a power of 2, i.e., there are exactly ð lenðkÞ i Þ rows with Hamming weight i for 0 i lenðkÞ and recall that i net XORs are required to compute the diagonal node on a row with Hamming weight i. We continue to transform and simplify the above inequality to get an upper bound on the cost of computing diagonal nodes costðdiagÞ X lenðkÞ i¼0 lenðkÞ i i
The lower bound can be derived similarly from
Therefore, costðdiagÞ is tightly bounded by Qðk log ðkÞÞ. The state machine that computes All0 fsm and All1 fsm includes a counter and comparators. All of the components of the state machine have a number of bit operations that is linear in the length of the counter, which is log ðnÞ. The number of bit operations in computing SomePatD is linear in k. The function G in the feedback function has a constant length input vector. Its cost is bounded by OðkÞ.
In summary, the number of bit operations needed to compute the feedback function of OcDeb-k-n is bounded by Qðk log ðkÞ þ log ðnÞÞ. t u
IMPROVED CONSTRUCTION
The improvements come in two flavors but all focus on the function SomePatD. One is further simplifying SomePatD k nÀ1 so that in the end, effectively it becomes a formula, named SomePatE k nÀ1 , that checks patterns on the leftmost two nodes on even rows and the leftmost node on the top row. The other is introducing variations of SomePatD k nÀ1 which check patterns on interior diagonals. In the end, we combine these two improvements and it results in a class of formulae for mesh k n ðxÞ that are even simpler than the one in Theorem 1.
Further Reducing Bit Operations in SomePatD k nÀ1
Recall that SomePatD
SomePatDðÁÞ is a disjunction of PatDðÁÞ with decreasing size of input vector. There are simplification opportunities between two neighbor disjuncts (e.g., PatDð j 
Next, we replace each diagonal node on every odd row with an addition of the leftmost two nodes on the even row below (e.g., j ). One obvious benefit in doing so is that many XORs used to compute the diagonal nodes on odd rows are saved because: i) per Lemma 2, a node on an even row always requires only about half the XORs used to compute a node on the odd row above it. ii) one of the even row nodes in the substitute expression does not incur any overhead because it is on the diagonal (e.g., j À 4 5 is on the diagonal and needs to be computed anyway). Additionally, All0ðÁÞ is very amenable to this replacement as shown by the first simplification step in the following example:
We apply substitutions of diagonal nodes on odd rows to (4) and it becomes
The first simplification step & removes 1 XOR for each substitution that happens in the input vector to All0ðÁÞ. The second step & uses the Karnaugh map technique on to give us the minimal formula in terms of those variables. Formula (5) can be thought of as looking at the leftmost two nodes on rows 0, 2, 4 and the leftmost node on row 6 for patterns that are consistent with the patterns on diagonal nodes between 1 and 6 or between 2 and 6 that are accepted by PatDðÁÞ. In other words, if patterns accepted by (5) on the relevant nodes exist, then the pattern accepted by PatD must exist on the diagonal segment between row 1 to 6 or the segment between row 2 to 6 and vice versa.
Originally, based on the net XOR consumption result in (2), 9 XORs are needed to compute the diagonal nodes j À (3). 4 After the aforementioned simplifications, we get (5), where 6 XORs are needed to compute the leftmost two nodes on even rows (except the highest row where only the leftmost node is computed), 1 XOR, 4 ANDs, 5 NOTs and 1 OR are needed for the rest. Therefore, 2 out of 9 XORs and 3 out of 7 ANDs are saved. The same approach can be followed for simplifying the other pair in SomePatD ; . . .Þ, replacing diagonal nodes on odd rows is still useful. All in all, these simplifications give considerable savings in bit operations as will be shown in Table 1 for general cases.
We proceed and generalize the simplification result (5) for any even n and even k.
Definition 11.
PatE e ðv ½h :
PatEðv ½h : l Þ ¼ PatE e ðv ½h : l Þ; when colðv hÀ1 Þ is even PatE o ðv ½h : l Þ; when colðv hÀ1 Þ is odd:
( Now we lift the simplification result for a pair of disjuncts (PatDðÁÞ _ PatDðÁÞ) to SomePatDðÁÞ. 
Introducing a Class of Formulae for mesh k n
Now we introduce the other instances in the class of simple formulae for mesh k n . In Fig. 9 , all nodes in the mesh of db- (6, 3) are shown with dots. If we apply Lemma 4, the top row nodes being all 0s and the diagonal nodes between the top row and row 2 satisfying PatD implies row 2 satisfies X. Interesting enough, as shown in the figure, if the top row's nodes are all 0s and the first interior diagonal between top row and row 2 satisfies PatD, row 2 also satisfies X. This is not a coincidence. In fact, this is true for PatD applied on any interior diagonal in Fig. 9 . There are only two interior diagonals in the figure; the second one is ð j The above equivalence extends Lemma 4, and as a result also extends Theorem 1. Specifically, there are two formulae that are equivalent to All0 6 ðx ½3 : 1 Þ^SomePatD 
4. These numbers are after considering common subexpression elimination at two places i) the diagonal nodes. ii) between ðxÞ) in Corollary 1 with its equivalent expressions from Corollary 2 gives rise to Corollary 3, which shows a class of simple formulae for mesh k n and all formulae have the same asymptotic complexity. Corollary 3 will prove to be crucial for efficient parallelization and hardware retiming. Fig. 10 shows two formulae from the class for mesh 6 4 and they will be described more later. Corollary 2. For any 1 i n À 1,
Corollary 3. For all 1 i n À 1,
Recall that x ½:nÀ1 M D generates the diagonal nodes. Interior diagonals are generated by shifting x ½nþkÀ1 : nÀ1 to the right and multiplying it with the same matrix M D . For example, the i th interior diagonal is generated by x ½nþkÀ1Ài : nÀ1Ài M D and it is checked by SomePatD k nÀ1 ðxÞ. Similarly, for one instance from Corollary 2, say i ¼ i 1 , SomePatE k nÀi 1 ðxÞ checks patterns on a vector made of two neighbor nodes that are ði 1 À 1Þ away from the leftmost nodes on even rows and one node that is ði 1 À 1Þ away from the leftmost node on the top row, which is x ½nþkÀi 1 : nÀi 1 M E , that is x ½nþkÀ1 : nÀ1 M E shifted to the right ði 1 À 1Þ positions.
Figs. 10a and 10b are two instances using SomePatE 6 3 and SomePatE 6 2 respectively. According to Definitions 12 and 13,
Nodes in the figures with no incoming lines are there simply to provide relative positions so that it is clear the input nodes to SomePatE e are the ones to SomePatE o shifted to the right one position.
EFFICIENT PARALLELIZATION
In general, to achieve a parallelism degree of d, we make d copies of feedback functions to compute d future values by moving inputs to the feedback functions to the left and decompose an n-stage shift register to an array of ðn=dÞ-stage shift registers. Identifying and utilizing common subexpressions among the copies of feedback functions to save computation resource is important for efficient parallelization. For example, Fig. 11 shows how to double the throughput of an LFSR with the feedback function fðx ½3 : 0 Þ ¼ x 1 þ x 2 þ x 3 and to save 1 XOR this parallelization reuses the computation of ðx 2 þ x 3 Þ shared by f and Nf, where Nf denotes f is evaluated after 1 clock cycle
The formula of mesh k n for db-ðk; nÞ is defined in Definition 8 and can not be efficiently parallelized, because of limited common subexpressions between X and NX, which is the function that extensively appears in mesh k n . The inefficiency in parallelizing X is because shifting its input vector may toggle the parity of the column where an element from the vector is positioned. For example,
Recall that OcDeb removes X and introduces SomePatE. Note that x nþk denotes the current feedback value and x (shorthand for x ½nþkÀ1 : 0 ) is the ðn þ kÞ-stage shift register in OcDeb-k-n. All0 and All1 are amenable to parallelization as shown in Table 2 , once they are computed straightforwardly instead of using a counter (Section 3.4.4). Let us focus on maximizing common subexpressions between SomePatE k nÀ1
and NSomePatE k nÀ1 . In Table 2 ; rewriteðNmesh k n ðxÞÞ is critical and it replaces SomePatE k nÀ1^A ll0 k with SomePatE k nÀ2Â ll0 k based on Corollary 2. According to Definition 13,
Thus SomePatE k nÀ2 ðx ½nþk : 1 Þ can be rewritten as in Table 2 . Note that most subexpressions in
also appear in All0ðx ½nþkÀ1 : nÀ1 M E ½kÀ2 : 0 Þ, which is part of
All in all, in the feedback function of OcDeb, most of mesh k n and Nmesh k n are shared and therefore, OcDeb can be efficiently parallelized.
HARDWARE RETIMING
In hardware non-parallel implementations, All0 and All1 are computed by a state machine while SomePatE e and G are purely combinational. The critical path in the design from Fig. 8 goes through SomePatE e . When we insert a register after SomePatE e on the critical path with the intention to increase clockspeed, we also need to shift the input to SomePatE e to the left so that it calculates values 1 clock cycle ahead to overcome the 1 clock cycle delay from the register. However, had we chosen the leftmost two nodes on each even row, shifting the input to SomePatE e to the left would cause the combinantional logic that we intended to separate to become connected in series again. It voids the performance benefit that could be brought by inserting a register. Therefore, we need to shift the input nodes to the right first, then add a register, and finally shift left in order to make retiming successful (Figs. 12a and 12b) . Shifting the input nodes to the right is valid due to Corollary 2.
In Fig. 12b , we successfully divide the critical path that goes through SomePatE e to two paths with shorter delay. For instance, by applying retiming as described above, the clockspeed of OcDeb-40-24 is increased from 1.02 to 1.35 GHz.
RESULTS
In this section, we compare the asymptotic complexity of OcDeb against other algorithms for generating de Bruijn sequences (Section 7.1), do a detailed analysis of area and performance for OcDeb with periods up to 2 544 (Sections 7.2 and 7.3), present implementation results for parallel OcDeb-32-32 (Section 7.4), and finish with a summary of the NIST SP800-22 randomness tests (Section 7.5).
We use two starting span-n sequence generators: Gammel et al.'s 32-stage NLFSR from [10] for its long period and a 24-stage primitive NLFSR from [19] for their near optimal linear complexity. The area and performance results are 
TABLE 2 A List of Functions
Function Expression
after logic synthesis for an STMicroelectronics 65 nm cell library using nominal delay values. Logic synthesis was done by Synopsys Design Compiler with ultra high mapping and optimization effort unless otherwise specified. [1] 's method has exponential memory usage. Jansen et al. [16] 's method includes verifying cycle representative, which is practically impossible to do when the sequence being constructed has long period. The implementation complexity of de Bruijn sequence generators constructed by this method is not provided because it is highly dependent on the feedback function's cycle structure. The other three methods in Table 3 are very expensive based on our estimates or experiments. The area for Mandal and Gong [21] 's approach at 1 bit/cycle is 4,876 GE, based on our experiments.
Complexity of Constructing de Bruijn Sequences
Effectiveness of Optimizations in Hardware
In Fig. 13 , we compare OcDeb with two other implementations of composited construction: the mesh based implementation (Fig. 5) and a serialized implementation that computes one row of the mesh in each clock cycle (i.e., the method by Mandal and Gong from [20] in hardware context), denoted as "MG". We fix n at 32 and let k increase from 4 to 512.
In Fig. 13 , the data points were acquired by setting a large clock period target, so that Design Compiler can minimize the area as much as possible by utilizing abundant slack time in achieving the clock period target. We observe that our OcDeb implementation is dramatically smaller than the mesh-based implementation, and it scales much better with k, which aligns with our complexity analysis. For OcDeb, the fluctuations seen are results of the variations of the cost to compute inputs to G after k-degree composition. The cost depends on the Hamming weight of k (Lemma 2). Because 4; 8; 16; 32; 64; 128; 256, and 512 have Hamming weight 1, local minimums appear at these locations. The MG implementations also scale well with k and have an area that is smaller than 2Â OcDeb's area. However, recall that MG's throughput (bits/cycle) is 1 kþ1 , therefore MG implementations have very low performance (bits/second).
In Fig. 14 , we measure the area and performance of OcDeb when the clock period target is set to 0.8 ns for all k. We also copy the minimum area of OcDeb from Fig. 13 to here. We can see that only a small overhead in area is incurred after Design Compiler succeeded in meeting the specified high clock speed target (1.25 GHz). The internal Qð2 nþk Þ À Jansen et al. [16] 3ðn þ kÞ À À Fredricksen [7] n þ k Qððk þ nÞ 2 Þ (25,000) Chang et al. [6] n þ k Vðk 2 Þ, Oðk 3 þ nkÞ (19,000) Mandal, Gong [20] n þ k Qðk 2 þ nkÞ 4,876 Ã OcDeb n þ k þ log ðnÞ Qðk log ðkÞ þ log ðnÞÞ 774
Results are given in terms of n and k where the period of the de Bruijn sequence is 2 nþk to enable comparison between the composited construction and other techniques. Act. : actual Est. : estimate. state size of OcDeb-k-n is ðn þ kÞ bits and it is an indicator of OcDeb's security level. Fig. 14 shows that it attains its high performance even as the security level increases.
Table of Detailed Results
Here we report more details of results on mesh-based, MG and our implementation of db-(32, 32) as well as two other larger instances of OcDeb, which are OcDeb-128-32 and OcDeb-512-32 in Table 4 . On the comparison of different implementations of db-(32, 32), we can see the module that computes mesh k n is not dominant anymore in our implemention. At a low optimization level, the area of OcDeb-32-32 is 6:2Â smaller than the mesh-based implementation, and its optimality is 56Â better than the MG implementation. Optimality (defined in Table 4 ) is a single metric we use that captures both area and performance. If we take a look at OcDeb-32-32, OcDeb-128-32 and OcDeb-512-32, we will notice that as k increases, the shift register becomes longer but the area cost of mesh k n increases faster than that of the shift register.
Parallel OcDeb-32-32 Results
Recall in Section 5, we elaborated that OcDeb is amenable to parallelization. Here, we demonstrate it with results of parallel OcDeb-32-32 at degree 2, 10, 15, 25 and 30 in Table 5 . For each parallelism, we experiment with Design Compiler by altering the clock period target and present the results that show the best area or best performance or best optimality.
In Table 5 , the highest performance we achieved is 25.4 Gbps while the throughput is 30 bits/cycle and the area is 4,787 GE. The highest optimality is 10.6 Mbps/GE which happens when the throughput is 10 bits/cycle, area is 1,570 GE and the performance is 16.7 Gbps. Because of the functional equivalence, OcDeb inherits their strong randomness properties. Our OcDeb family has two adjustable parameters, n and k. For an instance, OcDeb-k-n, it has an internal state of size ðn þ kÞ and generates a 2 nþk bit long de Bruijn sequence. The complexity of bit operations to generate one bit is Qðk log ðkÞ þ log ðnÞÞ while the previous best result is Qðk 2 þ nkÞ. We implemented the OcDeb family in VHDL and benchmarked it for ASIC 65 nm technology. We have shown OcDeb has significantly better results in hardware cost and performance than other implementations of the composited construction. OcDeb's area increases slowly and almost linearly as k increases and furthermore its performance is very insensitive to k. OcDeb-32-32 is an instance suitable for lightweight applications and it has area of 643 GE and performance of 1.45 Gbps, which is a 45Â better performance/area ratio than the serialized implementation proposed by Mandal and Gong. OcDeb-128-32 and OcDeb-512-32 are larger instances but more secure. They can have 1,886 GE, 1.25 Gbps and 7,304 GE, 1.25 Gbps, respectively. Our optimizations also enable efficient parallelization. As an example, the area of OcDeb-32-32 only increases from 643 to 1,612 GE when the throughput increases from 1 bit/cycle to 30 bits/cycle. By parallelization, OcDeb-32-32 can generate up to 25. 4 Gbps at the cost of 4,787 GE hardware. Guang Gong received the BS degree in mathematics, in 1981, the MS degree in applied mathematics, in 1985, and the PhD degree in electrical engineering, in 1990, all from Universities in China, a postdoctoral fellowship from the Fondazione Ugo Bordoni, in Rome, Italy, and spent the following year there. She was promoted to an associate professor with the University of Electrical Science and Technology of China. During 1995-1998, she worked with several internationally recognized, outstanding coding experts and cryptographers, including Dr. Solomon W. Golomb " For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.
NIST Randomness Results
