The enumeration of two-dimensional Costas arrays is a problem with factorial time complexity and has been solved for sizes up to 29 using computer clusters. Costas arrays of higher dimensionality have recently been proposed and their properties are beginning to be understood. This paper presents, to the best of our knowledge, the first proposed implementations for enumerating these multidimensional arrays in GPUs and FPGAs, as well as the first discussion of techniques to prune the search space and reduce enumeration run time. Both GPU and FPGA implementations rely on Costas array symmetries to reduce the search space and perform concurrent explorations over the remaining candidate solutions. The fine grained parallelism utilized to evaluate and progress the exploration, coupled with the additional concurrency provided by the multiple instanced cores, allowed the FPGA (XC5VLX330-2) implementation to achieve speedups of up to 30× over the GPU (GeForce GTX 580).
Introduction
A two-dimensional Costas array (2DCA) of size N is a permutation f : N → N such that for every integer h, i, and j, where 1 ≤ h ≤ N −1 and
, implies that i = j. Thus, informally, a size N Costas array is defined as N × N matrix containing exactly N dots, where every row and column contain exactly one dot and the vectors joining each pair of dots are distinct. Figure 1 illustrates a Costas array of size N = 6, both as a matrix and a permutation. The figure also shows the array's difference triangle, which organizes the differences between the various permutation digits in N − 1 rows where each row corresponds to a fixed h. By definition, each row in the difference triangle of a Costas array must consist of unique numbers.
Their definition implies that Costas arrays have perfect autocorrelation (autocorrelation = 1), which makes them useful in communications where signals must be recoverable even in the presence of considerable noise. Costas arrays are useful in many security and communication applications, such as remote object recognition and optical communications [1] . Furthermore, some special cases of Costas arrays can be used for digital watermarking [2] .
Costas arrays with dimensions higher than two were introduced in 2008 by Drakakis [3] . These arrays maintain perfect autocorrelation, which broadens their applicability in optical communications, for example, 3D optical orthogonal codes [4] . Multidimensional periodic Costas arrays (MPCAs) over elementary Abelian groups, introduced by Moreno and Tirkel , add the property of being periodic over all dimensions. This extends their applicability to digital watermarking of video and combined video and audio, where higher-dimensionality codes are desired [2] . A formal definition and some of their properties are presented in Section 3. In this paper, we focus on the latter kind of multidimensional periodic Costas arrays (MPCAs) due to their richer application range.
The enumeration of 2D Costas arrays has been a topic of interest since their discovery by Costas in the 1960s [5] . With each new size enumerated, new properties and generation techniques may be discovered [6] . Ortiz-Ubarri et al. presented MPCA transformations and their first enumeration in [7] . Given their relatively new discovery, it is expected 2 International Journal of Reconfigurable Computing that the enumeration of MPCAs will, as with 2DCAs, help researchers improve their understanding.
Both 2DCAs and MPCAs can be generated using algebraic constructions based on finite fields like the Welch and Lempel-Golomb constructions [7] . Small sizes can be enumerated using hand computation, yet the only known method to guarantee complete enumeration is by exhaustive exploration. The search space for complete enumeration of 2DCAs grows factorially with N, thus computer-based exploration is the only practical approach for medium and large sizes. The most common approach for the enumeration of 2DCAs is to use a backtracking algorithm that incorporates symmetries and other observations to further prune the search space. This paper presents the first discussion of search space pruning techniques for MPCAs. Both the FPGA and GPU implementations utilize these methods to significantly reduce the search space. Nevertheless, the worst case timecomplexity is still factorial, requiring tremendous run times even for small cases of N and m (the dimension).
This paper discusses our implementations for the enumeration of (m + 1)-dimensional Costas arrays in GPUs and FPGAs and constitutes the first description of such an enumeration. We present the techniques chosen to prune the search space as well as the organization of our designs. Our FPGA implementation achieved speedups of up to 30 times faster than the GPU. Furthermore, the modules that were created as part of the design process can easily be adapted to other constraint satisfaction problems that use backtracking.
The rest of this paper is organized as follows. Section 2 presents the relevant previous work, while Section 3 defines (m + 1)-dimensional Costas arrays and some of their symmetries. Section 4 describes the backtracking algorithm and its use for the enumerations. Section 5 introduces several techniques that allow us to prune the search space during the enumerations. Sections 6 and 7 discuss the GPU and FPGA designs, respectively. Section 8 reports and discusses the results and Section 9 provides our conclusions.
Previous Work
Most recent enumerations of 2DCAs have been completed using general purpose processors [8, 9] . The latest enumerations have been achieved by deploying many computer clusters (in all, thousands of cores) to concurrently explore disjoint parts of the search space. For N = 28 and N = 29, the time per single CPU was determined to be 70 and 366.55 years, respectively.
To the best of our knowledge, the only reported FPGAaccelerated 2DCA enumerations have been [10, 11] . Devlin and Rickard implemented sizes N = 13 through 19 on a Xilinx Spartan-3 XC3S1000 running at 25 Mhz and extrapolated their results to sizes up to N = 32 using the same device as well as the Virtex-5 XC5VLX110 [10] . The extrapolated execution times for N = 28 and 29 were 9.26 and 48.45 years, respectively. Arce-Nazario and Ortiz-Ubarri compared the execution of 2DCA enumeration in an FPGA (Virtex 5-XC5VLX330-2) and a GPU (GeForce GTX 580). The FPGA implementation achieved speedups of up to 40× over the GPU and 4.44× over the fastest reported software implementation [9] . The first generalizations of Costas arrays were given by Drakakis in [12] . He defined various classes of multidimensional arrays by modifying the Costas property constraints and presented examples. Moreno (1)
International Journal of Reconfigurable Computing 3 As a shorthand method, we may also enumerate the elements in a Costas array (Z p ) m using the index mapping
The distinct difference property can be verified, in a manner similar to 2DCAs, by using difference matrices. The differences for a (Z p ) m array are organized into p m − 1 matrices. For instance, for m = 2, each difference matrix h = (h 0 , h 1 ) is p× p and contains at each cell (i, j) the result of w i+h0, j+h1 − w i, j . The cells for differences that involve position (0, 0) are represented using * . 
Addition and Multiplication (Modulo p m −1) Symmetries.
Two algebraic symmetries introduced by Moreno et al. can be used to significantly reduce the search space for MPCA enumeration [13] . 
Backtracking
Backtracking is a general algorithm for solving a computational problem by incrementally generating all possible solutions. The execution of a backtracking algorithm can be modelled as a search tree where every node is a partial solution. Moving forward corresponds to approaching a valid solution, and going backward corresponds to abandoning a partial candidate that cannot possibly generate a valid solution. * * * h = (1, 0) 
For the purpose of this discussion, we define a subper- m given a seed permutation P init . Figure 3 illustrates the steps in the computational tree of the backtracking approach, given the seed ( * , 0, 4, 1, 2, 7) for (Z 3 ) 2 .
Techniques for Pruning the Search Space and Efficient Evaluation of Candidate Arrays
MDCA symmetries can be leveraged to reduce the search space in their enumeration. For instance, it can be deduced from Theorems 4 and 6 that backtracking exploration must proceed only through permutations lexicographically smaller than ( * , 0, p m /2 , p m /2 + 1, . . .).
(1) MPCAs with the * in any position other than (0, 0) are generated by periodic shifts of the arrays with the * in position (0, 0). These include all the geometric symmetries (horizontal flip, vertical flip, and 90 0 rotations) of the MPCAs. (3) Any permutation starting with * , followed by a nonzero element can be obtained by adding to a permutation that starts * , 0, that is, using Theorem 6. For example, ( * , 1, 5, 4, 0, 7, 6, 3, 2) is obtained from the addition of 1 to ( * , 0, 4, 3, 7, 6, 5, 2, 1).
Thus, the required exploration is reduced from (p 2 − 1)! to approximately (p 2 − 2)!/2 permutations in the worst case.
Evaluation of Candidate Arrays.
Computationally, we determine if a permutation is an MPCA by using the distinct difference property. In the backtracking algorithm, every time the algorithm moves forward by adding a new element p k to the permutation, the new differences generated by subtracting p k and p 1 , . . . , p k−1 are added to the corresponding arrays. Thus, the difference arrays fill up as the permutation in the backtracking tree grows and deplete as the backtracking algorithm moves backward.
From the MPCA definition it can be deduced that only half of the difference matrices must be maintained. To understand this, let us define the negative of a distance vector. The difference matrix for a distance vector −h accumulates the negatives of the differences collected by the matrix for the distance vector h. This is illustrated in Figure 4 . The current subpermutation being evaluated ( * , 0, 4, 1, 2, 7) has produced the differences shown in the matrices. Notice how the differences in h = (0,
Furthermore, using the index mapping 
Proof. Without loss of generality, we consider m
which implies one of two cases.
(1) j > j + h 1 . This implies that h 1 ≥ p − j, in which case, for the negative, h 1 < p − j and j < j + h 1 . Thus, the negative of this case can be found using a difference covered by b > a.
(2) j = j + h 1 and h 0 > i + h 0 . This implies that h 0 ≥ p − h 0 , in which case, for −h, h 0 < p − i and i < i + h 0 . Thus, the negative of this case can be found using a difference covered by b > a.
Example 13. For the difference matrix h
In our implementations, the difference matrices are managed as follows.
(i) A hash table of size p m − 1 is used for each of the (p m − 1)/2 matrices to keep track of its differences.
(ii) Whenever the permutation length increases (P X to P X +1 ), the differences between the last added digit and the rest of the digits are computed and compared to the contents of the corresponding hash tables. A hit in any of the tables indicates a repeated difference and hence a non-Costas permutation. Otherwise, the differences are registered in the corresponding tables. * * * h = (1, 0) (iii) Whenever the permutation length decreases (P N to P X −1 ), the differences between the dropped digit and the rest of the digits are deleted from the hash tables.
GPU Design
We perform our computations in a GeForce GTX 580 with 16 multiprocessors, each with 32 cores at 1.54 GHz clock rate, 1.5 GB of global memory, and 48 K of shared memory using the CUDA parallel computing architecture. Similar to many other computational problems where GPUs are used to accelerate algorithms in parallel, our implementation essentially decomposes the enumeration into many disjoint subspaces, which are deployed as threads to the GPU. Figure 5 illustrates the workflow of the GPGPU implementation. The Host (CPU) generates a set of K subpermutations P N m of size m < N that comply with the Costas property. The set is then passed to the Device (GPU), where for each subpermutation a thread is generated to complete the exploration, that is, each thread determines all (if any) Costas arrays that begin with its given subpermutation. While the threads are executing, the Host is generating the next set of K subpermutations. When all threads complete, the results are passed to the Host, and the next set of K subpermutations is transferred to the GPU.
Two quite similar versions of Algorithm 1 run in the Host and each of the threads of the GPU. In the Host, Algorithm 1 is used to generate all subpermutations of length M compliant with the Costas property. As each subpermutation is found it is added to an array of size K of the data type shown in Algorithm 2 . When the array is full it is copied to the GPU global memory and the K CUDA threads are deployed to process the copied subpermutations.
Each CUDA thread runs a version of Algorithm 1 that takes one of the subpermutations as P init and copies any found Costas arrays back to the GPU global memory. When all the GPU threads are done, the found Costas arrays of size N are copied to the Host. The number of subpermutations (K) to be passed to the GPU is determined by the number of cores of the GPU. In our experiments we obtained the best performance with K = number of cores × 128.
FPGA Design
Many FPGA implementations obtain their impressive performance by exploiting fine-grained parallelism through deep, custom pipelines. However, backtracking is in essence a serial process, that is, generates permutation, then evaluates, then accepts or backtracks. Given this scenario, we opted to implement a highly tuned, low-resource serial MPCA enumeration (MPCAEn) core that can be instantiated many times inside the FPGA. Hence, the acceleration provided by our design comes mainly from two factors: (a) the rapid generation of candidate permutations and their evaluation within each core and (b) the high number of cores working in parallel on subsets of the enumeration. Figure  6 illustrates the basic blocks of our MPCAEn core. A shift register is used for constructing or reversing the candidate permutation. The candidate permutation is constructed by shifting left and concatenating a new digit in the right-most position. As a permutation is generated, its compliance with the Costas array definition is verified by the Costas evaluation block. If the candidate complies, the next permutation P N m+1 is generated by shifting left and concatenating the lowest available digit to the right-most position. If not, then one of two cases may occur.
Backtracking Functionality and Array Evaluation.
(1) There is an available digit d that is higher than the rightmost digit. In this case, the rightmost digit is substituted by d. For example, ( * , 0, 4, 1, 2, 6) is evaluated and does not comply. Since the available digits are 3, 5, and 7, then 6 is substituted by 7 to compose the next candidate permutation ( * , 0, 4, 1, 2, 7).
(2) There are no digits available higher than the rightmost digit. In this case the shift register is shifted right and the digit that is shifted out is added to the available digits. This is repeated until there is an available digit that is higher than the current rightmost digit, at which case we perform the substitution described in the first case. For example, ( * , 0, 4, 3, 7) is determined to not comply with the MPCA difference property; the available digits are 1, 2, 5, and 6 but none of them is higher than 7 so the permutation must backtrack to ( * , 0, 4, 3). The available digits are now 1, 2, 5, 6, and 7, thus 3 is substituted by 5 to form the next permutation ( * , 0, 4, 5). Figure 7 illustrates the operation of the MPCA evaluation block. When a new digit is added to the current permutation, for example, p 4 in the figure, the differences between the new digit and the rest are computed. The negatives of the differences are also computed, since they might be used to update some of the difference matrices (as explained in Section 5.1). Depending on the index of the newly added digit, the encoder/mux block routes the results of the differences to the corresponding hash tables. For the example in the figure, the result from Diff 2 corresponds to p 4 In the illustration, the permutation ( * , 0, 4, 1, 2, 7) has just been generated. The hash table keeps record of the digits that are in use in the current permutation (indicated by 1) and outputs two digits, the lowest available and the lowest available higher than the current input.
For 2DCAs, the deeper rows of the difference triangle are responsible for the detection of only a small percentage of the rejected arrays, as confirmed in [10, 11] . Our enumeration core for MPCAs is parameterizable in the number of distance vectors that are used to evaluate whether an array is an MPCA. The designer may choose to implement less than (p 2 − 1)/2 distance vectors to save FPGA resources. If so, the few false positives that will come out from the FPGA enumeration can be eliminated in software once they are communicated to the general purpose processor (GPP). Figure 8 shows the amount of FPGA LUTs required by the MPCAEn core for (Z 5 ) 2 and (Z 7 ) 2 , implementing various amounts of vector differences. LUTs are the most strained resource in the MPCAEn implementation (versus registers). The resource utilization results were obtained from the synthesis process using Xilinx ISE 11.5 targeting a Virtex5-XC5VLX330-2 FPGA. It was found through experimentation that keeping track of more than 7 differences for (Z 5 ) 2 and 14 differences for (Z 7 ) 2 did not significantly reduce enumeration times. The enumerators that were implemented for obtaining the results used those amounts of differences, for example, 7 for (Z 5 ) 2 and 14 for (Z 7 )
Resource Utilization and Core Organization.
2 . Since resource utilization per enumerator is low, multiple MPCAEn modules were instantiated in the FPGA as illustrated in Figure 9 . The transfer of subpermutations and collection of results is performed by the transfer/collector module, which is connected through low-width data lines to the enumerators in order to save connection resources. MPCAs are so scarcely found during the enumeration that, regardless of the bandwidth between enumerators and transfer/collector module, these connections never become a bottleneck.
Results and Discussion
To compare GPU and FPGA performance, we implemented sizes 3 × 3, 5 × 5, and 7 × 7 of our MPCA enumeration designs in GPU (GeForce GTX 580) and FPGA (one Xilinx XC5VLX330-2 device of the four provided in a Convey HC-1 Server). Table 1 shows the number of found MPCAs starting with ( * , 0), the execution times for both designs as well as the growth rate as a function of p, and the speedup of FPGA versus GPU. Results for 3 × 3 and 5 × 5 are wall clock times while 7 × 7 is the worst case estimation of the run time based on sample runs.
We attribute the speedup mainly to the fact that the FPGA implementation was able to exploit the following two levels of parallelism, whereas the GPU could only make use of the highest level:
(1) coarse-grained level parallelism, which is achieved by splitting the solution space into multiple disjoint sets;
(2) fine-grained parallelism at the level of individual candidate evaluations, that is, the operations for the evaluation of each (sub)permutation are performed in parallel (as discussed in Section 7 and illustrated in Figure 7 ). An analogous, low-level technique could be used in general purpose processors by utilizing SIMD instructions. However, CUDA programs compile to the PTX instruction set, which does not contain SIMD instructions.
The GPU (Z 3 ) 2 was greatly overshadowed by the data transfer times between FPGA/GPP; therefore, we only consider for fair comparisons the cases (Z 5 ) 2 and (Z 7 ) 2 . For these larger cases we observe FPGA versus GPU speedups similar to those reported for 2DCAs in [11] . The enumeration of MPCAs exhibits a slower growth rate (approximately 3) per additional permutation digit as compared to the reported for two-dimensional Costas arrays (approximately 5) [8, 11] . We conjecture that the reason for the slower growth rate is that the conditions imposed in the MPCA definition are more strained, thus eliminating more candidate subpermutations earlier than the case of 2D Costas arrays.
All enumerated MPCAs turned out to be either Welch Costas constructions as presented in [13] or their symmetries introduced in [7] , that is, no spurious MPCAs were found similar to some sizes of 2DCAs. These results support the conjecture that MPCAs (of all sizes and dimensions) are fully characterized by multidimensional Welch Costas arrays along with their symmetries.
Conclusions
In this work, we presented designs for the enumeration of multidimensional periodic Costas arrays in GPUs and FPGAs. Also, we introduced several MPCA symmetries and showed how they are used in our designs to significantly prune the search space. Both GPU and FPGA implementations rely on the concurrent exploration of multiple disjoint areas of the search space. In the GPU implementation, hundreds of threads are deployed to complete the search using the many GPU cores. For the FPGA, a multidimensional periodic Costas arrays enumeration core was designed taking into consideration pruning techniques while maintaining a low use of logic resources. Multiple cores were instantiated Figure 7 : Detail of the Costas evaluation block. multiple cores allowed the FPGA implementation to achieve speedups of up to 30× over the GPU. The implementations completed the first reported enumeration for MPCAs. Furthermore, the MPCA properties and symmetries discovered throughout the process help improve our understanding of these new structures.
