We are proposing parallel constructs for circular correlation, suitable to be used in GPS receivers. We have preliminary Simulink models that yield the correct result for the proposed architectures. We also developed and validated, using field recorded GPS satellite data, a MATLAB implementation of the MIT-Quicksynch algorithm for fast circular correlation based on the sparse characteristics of the circular correlation output. Simulink models are underwork. We developed a new parallel one-dimensional FFT algorithm and implemented new parallel cyclic convolution algorithms. Polytechnic University of Puerto Rico 377 Ponce de Leon Ave.
Names of Post Doctorates

Names of Faculty Supported
Names of Under Graduate students supported
Names of Personnel receiving masters degrees
Names of personnel receiving PHDs
Number of graduating undergraduates who achieved a 3.5 GPA to 4.0 (4.0 max scale): Number of graduating undergraduates funded by a DoD funded Center of Excellence grant for Education, Research and Engineering:
The number of undergraduates funded by your agreement who graduated during this period and intend to work for the Department of Defense The number of undergraduates funded by your agreement who graduated during this period and will receive scholarships or fellowships for further studies in science, mathematics, engineering or technology fields:
Student Metrics This section only applies to graduating undergraduates supported by this agreement in this reporting period
The number of undergraduates funded by this agreement who graduated during this period: The number of undergraduates funded by this agreement who graduated during this period with a degree in science, mathematics, engineering, or technology fields:
The number of undergraduates funded by your agreement who graduated during this period and will continue to pursue a graduate or Ph. Since the DFT can be written in the form of a cyclic convolution, we used our sectioned cyclic convolution algorithm for parallel implementations of a one-dimensional DFT. Such implementations are slower than serial FFT implementations but can access the larger memory available in clusters. Although not as efficient as the traditional row-column implementation, and also length-constrained, our method was able to tackle signal lengths that were not suitable for the row-column algorithm.
Time domain-based implementations of cyclic convolution are dramatically slower but, unlike frequency domain-based algorithms, give the exact integer output when the input is a sequence of integers. 
Objectives
Original Project Objectives a) Find out the relative performance of the newly developed parallel cyclic convolution algorithms described in our proposal to compute very long length cyclic convolutions [1] , [2] , [5] , [15] .
(Completed) b) Establish performance benchmarks regarding the newly introduced architectures for parallel computation of long-length one-dimensional, cyclic convolution within different cluster and multicore configurations [4] , [11] , [12] . (Completed) c) Find out the limiting computational factors (length, memory, inter-processor communication and the like). As the problem length increases monitor which factor becomes the limiting computational bottleneck [1] . (Ongoing) d) Introduce, at the HPC laboratory, the use of MATLAB, the Parallel Computing Toolbox and the MATLAB Distributed Computing Server. These software packages were selected to provide a rapid prototyping and an entry level research environment. Afterward, implementations in C++ language will follow [1] , [2] , [5] , [13] . (Completed) e) Give a proof for two operator-oriented formulas suitable to unfold the data flow graph of cyclic and no-cyclic. Such formulas were developed and published under a previous ARO grant [16] .
(Completed) f) Study and formulate parallel one-dimensional Discrete Fourier Transform implementations based on the proposed parallel cyclic convolution architectures. Compare and benchmark against traditional approaches [4] , [11] , [12] . (Completed) g) Increase the number of minority students that are involved and/or are aware of issues relating DSP algorithms and parallel processing. (Ongoing).
Future Work Objectives
a) All of our structures for parallel cyclic convolution can be applied to parallel circular correlation. Circular correlation is used, among others, in GPS receivers. We are proposing several structures that complement the ones presented in 2012 by a group from the Ècole Polytechnique (Laussane), several of which are basically the same that we had originally proposed. These parallel algorithms have a potential for a 2-fold+ increase in performance [8] , [9] , [10] . (Ongoing Work) b) We are involved in the implementation and validation of the MIT developed Quicksynch algorithm for fast circular correlation in GPS SDR systems. The algorithm is based on the sparsity of the correlation output. We developed MATLAB code validated with actual GPS satellite signal records. Simulink models are under work. As a byproduct of our project one of our student research assistants proposed, and was selected, to implement this same algorithm in an Open Source platform under the auspices of the Google Summer of Code 2014 program [8] , [9] , [10] . (Ongoing Work)
Significance
Large Data Sets: Today technology demands the processing of ever larger data sets. One particular problem when processing large signals is to split a large processing operation, such as cyclic convolution, into smaller subtasks in order to access the much larger memory available in clusters. Our sectioned cyclic convolution algorithm can provide parallel frequency domain-based implementations as well as parallel time domain-based implementations of cyclic convolution. Time domain-based implementations can be further accelerated by using our serial-recursive approach in the parallel subsections. Parallel implementations of FFT-based cyclic convolution are likely to be slower than the serial FFT-based implementation but have the potential to access the much larger memory available in clusters. Parallel implementations of cyclic convolution using time-domain based algorithms, which guarantee the exact integer result when performing purely integer convolution, are likely to be faster than their serial time domain-based, counterparts.
Round-off Errors in Frequency Domain-Based, Purely
Integer, Cyclic Convolution: Another significant problem, especially when dealing with very large length sequences, are floating point errors that affect FFTs to the point that a FFT-based cyclic convolution implementation of integer sequences may not give the exact integer results. This is a problem in applications such as multiplication of large integers, computer algebra packages, computational number theory and others. Our parallel, or serial-recursive, sectioned implementation of cyclic convolution can use time-domain based algorithms for the parallel, or serial-recursive, sub-convolution stages and therefore will guarantee the exact integer output when performing purely integer circular convolution. If such time-domain based algorithms are sufficiently accelerated they may be considered as an alternative for lengths where other error-avoiding fast cyclic convolution techniques such as the ones using Number Theoretic Transforms, are not applicable or are slower. Furthermore, if our techniques are used for a frequency domain-based implementation of cyclic convolution the round-off errors are reduced due to the shorter length sequences in the subsections.
Parallel implementation of one-dimensional DFTs:
The most common technique used to implement a one-dimensional DFT in clusters or distributed environments has certain lengths constraints. We are proposing a new technique, which is not as efficient but can tackle certain signal lengths that are not suitable for the standard row-column approach.
Applications to Parallel Circular Correlation (New Research Direction):
Circular correlation is an operation widely used, most conspicuously in GPS receivers. Parallelization can result in a 2-fold+ increase in performance in FPGAs implementations. It turns out that our structures for cyclic convolution are readily applicable to this task. We intend to propose several structures that complement the ones presented in 2012 by a group from the Ècole Polytechnique (Laussane).
Accomplishments Theoretical Work, Code Development and Benchmarking
Parallel One-Dimensional DFT: Developed a novel algorithm for parallel implementation of a one-dimensional Discrete Fourier Transform. This algorithm can tackle signal lengths that are complementary to those amenable to an implementation using the more efficient "row-column" approach (standard method). Parallel implementations of Fast Fourier Transforms (FFTs) are in general slower than the serial counterparts but they allow the processing of larger sequences [4] , [11] , [12] . (To be submitted for publication)
Reduction of Round-off Errors in Frequency Domain-Based Cyclic Convolution:
Using our cyclic convolution structures we were able to lower, or eliminate, round-off errors when using Fast Fourier Transforms (FFTs) to compute cyclic convolution of very large sequences. C++ and MATLAB implementations were benchmarked [2] , [5] .
(Published)
Time Domain-Based Parallel Cyclic Convolution: We implemented and benchmarked parallel cyclic convolution algorithms based on our novel structures. Such implementation achieved impressive results when accelerating time domain-based cyclic convolution algorithms. FFT-based implementations of cyclic convolution cannot be accelerated using this method. Time domainbased cyclic convolution algorithms are slower than FFT-based algorithms but guarantee the exact integer result for purely integer cyclic convolution, which is important in certain areas such as multiplication of large integers, computer algebra, cryptology, computational number theory, experimental mathematics and others [1] , [2] , [5] , [15] . (To be submitted for publication).
Mixed Mode MPI-OpenMP versus MPI Implementations:
We benchmarked a mixed mode MPIOpenMP C++ implementation of parallel cyclic convolution versus a direct MPI implementation. In this case-study, even though we matched the algorithm structure in the mixed-mode approach to the multicore node architecture of the cluster, the direct MPI approach proved to be slightly better [1] , [5] . (Published).
New Research Direction (Parallel Correlators for GPS Receivers):
All of our parallel structures for parallel cyclic convolution can be applied to parallel circular correlation. Circular correlation is used, among others, in GPS receivers. We developed parallel algorithms that are complementary to the ones presented in 2012 by a group from the Ecole Polytechnique, Laussana (several of which are the same that we originally proposed in the first place). Specifically, we can propose a variant that avoids the use of numerically controlled oscillators, which they use in their constructs, and we will propose further complementary architectures. We have preliminary Simulink models validated with modeled data and we are developing Simulink models that will work with field recorded GPS signal data [8] , [9] , [10] . (To be submitted for publication).
Further Code Development (GPS Correlators):
We developed and validated MATLAB code and Simulink models and libraries for standard baseline GPS algorithm to implement GPS correlators. This is to be used as a testbed and to compare accuracy and performance with the parallel correlator constructs that we are proposing. We also did a MATLAB implementation of an algorithm developed at MIT (Quicksynch) for GPS SDR receivers based on the sparse nature of the correlator output [8] , [9] , [10] .
Improved Digit Reversal Routine for MATLAB. As part of the code development effort, we were able to generate fast, efficient, stride permutation routines in MATLAB. We used such routines to do an implementation of digit reversal that, for large sequences, is running faster than the native MATLAB digit reversal command. We can also tackle larger sequence lengths and larger radixes than the ones allowed by the MATLAB native implementation [14] . (To be submitted to a student conference).
Further Theoretical Work: Development of an induction proof for two operator-oriented formulas suitable to unfold data flow graphs of cyclic and linear shifts. This proof supports the correctness of such formulas, developed and published while working in a previous ARO project [16] .
6
Maintenance and Upgrade of our Hardware/Software Testbed
Hardware (Cluster): Updated and made operational a 64 processor, 16-node Dell cluster purchased under a previous ARO grant. The cluster is dated but adequate for benchmarking and scalability studies. An operations manual was developed [13] .
Hardware (GPU):
The cluster was further upgraded by installing four additional nodes with GPU processors. This particular architecture will be used for further research [13] .
Software (MATLAB, C++):
A 64-worker MATLAB Distributed Computing Server was purchased and installed in the cluster. We can also use C++ in the cluster [13] .
Software (MATLAB):
The MATLAB Parallel Computing Toolbox was installed in two multi-core Dell servers (4 and 12 processors), cluster master node, and 3 multi-core laptops. These computers are to be used as MATLAB clients for the cluster and as multi-core testing platforms. We are gaining expertise in the use of MATLAB for parallel DSP algorithm development.
Collaborations
We are having internal collaboration, within our own institution, with a different group of researchers/funding where we are making available our expertise, hardware resources and parallel processing capabilities using MATLAB. The research is a subset of a larger project in our Plasma Physics Lab, funded by NRC, and is related to the development of a model to study electric particle containment within the plasma chamber.
Conclusions
Time Domain-Based Parallel Cyclic Convolution: The performed benchmark shows that our parallel implementation, combined with our serial recursive implementation, of cyclic convolution is suitable to accelerate time domain-based cyclic convolution algorithms. The serial recursive implementation on its own can also accelerate time-domain based cyclic convolution implementations. Time-domain based implementation guarantee the exact integer result when performing purely integer convolution, which is important in certain areas such as multiplication of large integers, computer algebra, cryptology, computational number theory, experimental mathematics and others.
Reduction of Round-off Errors in Frequency Domain-Based Cyclic Convolution:
Frequencydomain based cyclic convolution algorithms, as expected, cannot be accelerated using either our parallel, recursive or combined approach but, if used, we found that the round-off errors introduced by the FFTs are reduced (because of their shorter-length sequences in the subsections) in exchange for a reasonable decline in overall performance. This could be important in certain applications were large round-off errors in purely integer circular convolution are not acceptable.
Parallel One-Dimensional DFT: We were able to develop a novel algorithm for the parallel implementation of a 1-D DFT in cluster architectures or distributed environments. We coded and benchmarked this algorithm. Albeit not as efficient as the established technique it expands the range of signal lengths that can be tackled, therefore it could be considered as complementary to the traditional procedure that maps a 1-D DFT to a 2-D DFT followed by a row-column decomposition.
New Research Direction (Parallel Correlators for GPS Receivers):
All of our parallel structures for parallel cyclic convolution can be applied to parallel circular correlation. Circular correlation is used, among others, in GPS receivers. We developed parallel algorithms that are complementary to the ones presented in 2012 by a group from the Ecole Polytechnique, Laussana (several of which are the same that we originally proposed in the first place). Specifically, we can propose a variant that avoids the use of numerically controlled oscillators, which they use in their constructs, and we can provide further complementary architectures.
Further Code Development (GPS Correlators):
We developed MATLAB code and Simulink models and libraries to implement GPS correlators. This is to be used as a testbed to validate the parallel correlator constructs that we are proposing. We also did a MATLAB implementation of an algorithm developed at MIT for GPS software design radio (SDR) receivers based on the sparse nature of the correlator output.
Preservation and Upgrade of our Hardware Assets:
We updated a cluster from a previous DoD grant and installed the MATLAB Distributed Computing Server. In addition we added four GPU nodes to this cluster for future research. The cluster itself is dated but useful benchmarking, scalability studies and educational purposes. We also defined several multi-core computers for benchmarking and research using the MATLAB Parallel Computing Toolbox.
Students:
We have had several graduate and undergraduate research assistants who have gained valuable experience. Two of our undergraduate research assistants were awarded summer internships at NIST and one was awarded an internship with the Google Summer Code in order to develop, and expand, using an open source platform the same GPS algorithms that we implemented in MATLAB and Simulink. All graduates students are working or pursuing graduate studies.
Future Work
Complete our research regarding the development of novel parallel circular correlators with applications to GPS receivers and other potential uses.
Continue working on the development of MATLAB code, Simulink models and libraries to test our proposed circular correlation constructs with application to SDR GPS receivers.
Assess the viability of implementing a, MIT developed, circular correlator based on the sparse characteristics of the correlator ouput in combination with our parallel circular correlator structures, possibly using FPGA co-processors, for added performance.
Use our cluster architecture that includes 4 nodes with GPUs for further experimentation.
Keep involving students and interact with faculty to promote algorithm development and parallel processing in our institution.
