ABSTRACT The use of practically non-repeating spreading codes to generate sequence-based spread spectrum waveforms is a strong method to improve transmission security, by limiting an observer's opportunity to cross-correlate snapshots of the signal into a coherent gain. Such time-varying codes, particularly when used to define multi-bit resolution arbitrary-phase waveforms, present significant challenges to the intended receiver, who must synchronize acquisition processing to match the time-varying code each time it changes. This paper presents a series of options for optimizing the traditional brute-force matchedfilter preamble correlator for burst-mode arbitrary-phase spread spectrum signals, achieving significant computational gains and flexibility, backed by measurable results from hardware prototypes built on an Intel Arria 10 Field Programmable Gate Array (FPGA). The most promising of which requires no embedded multipliers and reduces the total hardware logic by more than 76%. Extensions of the core fallthrough correlator techniques are considered to support low-power asynchronous reception, underlay-based physical layer firewall functions, and Receiver-Assigned Code Division Multiple Access (RA-CDMA) protocols in Internet of Things (IoT)-caliber devices.
I. INTRODUCTION
The design of burst-mode communication systems presents additional challenges over that of a standard continuous data link, in particular due to the need to re-acquire the signal on a burst-by-burst basis. In low-power devices, such as those suitable for Internet of Things (IoT), burst-mode waveforms traditionally employ techniques to make the acquisition preamble as easy to receive as possible, typically by embedding pilot tones [1] , repeated cyclic prefixes [2] , cyclic autocorrelation functions [3] , soft-handoff between spreading codes [4] , Barker-sequence / short preamble repetition [5] , maximal-likelihood estimation [6] , and/or variations of matched-filter techniques [7] , [8] . Virtually all of these approaches rely on an inherent cyclostationary signal feature of the preamble bursts, facilitating blind detection and/or exploitation by an unintended receiver.
All signals considered in this paper are digital chaotic sequence-based arbitrary-phase spread spectrum waveforms with optional chip amplitude shaping, most of which use
The associate editor coordinating the review of this article and approving it for publication was Byung-Seo Kim.
practically non-repeating spreading codes designed to eliminate cyclostationary signal content. Reception of these signals is more complicated, using methods that adapt some aspects of the matched-filter/coherent receiver processing architectures for specific waveforms and/or use cases [9] - [12] . Further, most of these techniques are computationally intensive, making them difficult to implement in a low-power device.
Starting with the traditional brute-force matched-filter correlator, this paper presents computational efficiency improvements for a generic coherent receiver architecture where the matched-filter coefficients change on a burst-to-burst basis, offering lower-power / computationlly efficient methods that achieve the same purpose. Similar analyses have evaluated the reduced-computation processing of the semi-coherent chaotic carrier shift keying (CSK) waveforms [13] . There, however, the timing and phase are effectively coherent.
The core fallthrough correlator design model is provided in Section II. Enhancements for reduced-precision correlations, optimally pruned coefficients, and variable-length operations are all considered in Section III. Measurable results from hardware prototypes built on an Intel Arria 10 field-programmable gate array (FPGA) are presented in Section IV, offering simpler methods for asynchronous reception, underlay-based watermark validation [14] , and receiver-assigned code-division multiple access (CDMA) operations [15] in IoT-caliber devices. Finally, conclusions can be found in Section V.
II. COMPUTATIONAL MODEL
The time-based evolution of matched-filter coefficients eliminate many of the standard methods for collapsing the digital logic structure to take advantage of a priori known cyclostationary preamble signal features, while the multi-bit resolution spreading codes used to generate the arbitrary-phase spread spectrum waveforms, such as the chaotic sequence spread spectrum (CSSS) [9] , [10] or high-order PSK signaling (HOPS) [16] waveforms, increase overall complexity of the complex-conjugate multiplications (correlations) in contrast to 2-ary or 4-ary chip phase direct sequence spread spectrum (DSSS) signals [17] . To support the discussion, consider the matched-filter correlator model shown in Fig. 1 , where each x is a complex-valued received signal sample and each y is a matched-filter coefficient taken from the internally generated preamble signal replica.
This model is similar to a direct form finite impulse response (FIR) filter, where a fully pipelined set of outputs are derived from incoming samples as they progress through the delay line structure. With FIR filters, significant improvements may typically be made (a) due to symmetry of wisely chosen coefficients (pre-additions), (b) elimination of sufficiently small / zero coefficients (pruning), (c) canonic signed digit (CSD) mapping of static coefficients to shift-adds [18] , (d) employing computationally efficient multi-rate processing structures [19] , and (e) variable control of the coefficient word widths.
Within the correlation calculation, these traditional simplifications are limited: (a) the correlation of a preamble without any cyclostationary features can not have easily exploitable symmetries, (b) coefficients may be pruned, though the correlation taps generally contribute a similar amount of energy to the composite correlation value, (c) CSD mappings may also be applied, but must be dynamically addressable with variable barrel shifters, (d) the notionally fixed sample rate hinders any multi-rate signal processing benefit, (e) and the coefficient word width in hardware will need to support the largest width that the coefficient may ever be, burst-to-burst.
In addition to these distinctions from a standard FIR filter, the logic within the fallthrough correlator must support clocking in of new coefficients on a burst-by-burst basis, so that they are in place and ready for correlation processing when the next sample arrives. Under the assumption of normalized inputs, the correlator output response Z ideally triggers based on a defined correlation peak having magnitude equal to the average chip energy times the length of the correlation. The coherent preamble signal is trivial to normalize, while the incoming received signal is variable and highly dependent on any system gains that may occur prior to the correlator. This is particularly important for spread spectrum systems, since the signal often operates at or below the ambient noise floor of the receiver and allows for power level estimates of the incoming signal and/or its multipath components based on the magnitude of the resulting correlation peak(s).
The next distinction is that of phase rotations, with particular focus on center frequency offsets. The static phase rotation may be detected from the phase offset of the correlation peak (referenced to the center of the correlation window) and subsequently corrected prior to despreading. Frequency offset, on the other hand, requires comparison of multiple sub-correlation values throughout the correlator structure, so that the phase rotations may be measured as a function of time and translated, via the known sample rate, to an instantaneous frequency offset that can be applied to the remainder of the pulse. If the frequency offset causes the correlation values to drift more than ≈ π/2 radians over the duration of the preamble, then the integration process underlying the addition of the taps will begin to fail.
The final distinction of timing uncertainty due of phase noise or oscillator drift is also not supported by this FIR structure. Practical clocks (<100 ppm) will tend not to drift beyond that which is supported, and the detection of future preambles will have unique starting sample points, making only the short-term stability of individual bursts relevant.
III. FALLTHROUGH CORRELATION TECHNIQUES
The chief focus of this paper is on the computational efficiency improvements that may be made to the fallthrough correlator to achieve reasonably solid performance from a minimum amount of hardware. In particular, the allowable resource and performance trades from the hardware baseline of a brute-force design that implements a complex multiplication z = y I + jy Q x I − jx Q using the three real-multiplier reduction in (1) and (2), where z I + jz Q is a partial sum.
To adapt this model to an IoT-relevant context, the following series of identified improvements may be incorporated. VOLUME 7, 2019 A. TRUNCATED COEFFICIENTS The precision of the matched-filter coefficients may be reduced with acceptable detection loss, even for arbitrary-phase waveforms. 1 Such truncation must account for the full processing chain of the transmitted signal, including any interpolation, prior to transmission. Since each arbitrary-phase spreading chip is taken from an allowable set of discretized phase points on the unit circle [16] , truncation in both the in-phase (I) component y I and quadrature (Q) component y Q will introduce amplitude and phase mismatch loss to the calculation. Using a bit precision ≥ 6 bits gives almost no performance loss, while truncation to 1-bit coefficients provides the largest computational gains, offering a hardware structure resembling the correlator of DSSS signals.
Choosing the 1-bit truncated coefficients, the correlation logic may be implemented as four negations of {x I , x Q } based on sign{y I , y Q } followed by two additions. Although, any quantization effects of truncation should be considered prior to processing the correlation peak for received signal power estimations. For the hardware prototypes, the overall HOPS waveform is constant envelope (i.e., x 2 I + x 2 Q = 1∀x), and the output response can be scaled by the reciprocal of the expected coherent signal correlation E |x I | + x Q = 2 · E [|x I |] = 4/π to correct for this distortion.
B. DYNAMIC PRUNING
Upon definition of the matched-filter coefficients, if either component in I or Q does not contribute a meaningful amount of energy to the correlation, then the coefficient may be collapsed into a single real-valued or imaginary-valued correlation tap. For the spread spectrum waveforms with amplitude-varying chips, this pruning may be pursued consistent with the selective noise cancellation techniques described in [20] , while for the constant envelope modulations, a parameter λ can be defined to represent the amplitudes of components to be discarded, as shown in Fig. 2 .
In any scenarios where min{|y I | , y Q } < λ, then max{|y I | , y Q } > √ 1 − λ 2 , resulting in the detection performance loss shown as a function of λ in Fig. 3 . The simulated loss of 0.87 dB at the median value λ = √ 2/2 allows a simplified pruning process equivalent to selecting the larger of {|y I | , y Q } for correlations with the received signal. In other words, (1) is reduced to: 2
1 The arbitrary-phase nature of these waveforms requires on the order of 2 k allowable phase words, with k ≥ 8.
2 By using only correlation taps that do not contain points at |y I | = y Q , which is easily achieved by rotating the entire set of allowed discretized points by the phase of one half LSB, a strict maximum may be achieved. In the case where the two values are equal (within the chosen comparator's precision), then the choice of which one to take forward is arbitrary. and (2) is reduced to:
With 1-bit truncated λ = √ 2/2 pruned coefficients, the correlation logic may be implemented as four negations of {x I , x Q } based on sign{y I , y Q } followed by max{|y I | , y Q }-induced selection of the complex-valued output. Despite the further hardware savings, pruning at the median value eliminates the amplitude mismatch by rotating the allowable taps onto the axes and reduces any phase mismatch to within [−π/4, π/4], giving a degradation reduction of 3 dB over truncated coefficients without pruning. Using a smaller value for λ provides only marginal performance increases and requires dynamic placement of the adders -to allocate these adders on a burst-to-burst basis is likely to take more logic than simply provisioning all taps with the same two adder structure. 
C. FOLDED CORRELATION TAPS
The sequence of matched-filter coefficients may be folded by consciously aliasing the correlation taps onto one another, achieving a significant reduction in the digital logic dedicated to the long delay lines of the received signal. While the increase in false positives would be unacceptable for lightly spread signals, the self-interference characteristics of deeply spread signals allow for minimal performance loss. Although, shorter preambles do have more difficulty in estimating phase rotations / frequency offsets, placing a practical minimum bound on the order of 2 symbols.
Consider the 4× folded correlator shown in Fig. 4 , trading some additional control logic for an effectively reduced delay line length of one-fourth its original length. For the folded taps hardware prototype, the 1400 correlation taps are divided into four equal-length sets of 350 taps each. That is, {a 0 , a 1 , . . . , a 349 } ≡ {y 0 , y 1 , . . . , y 349 }, {b 0 , b 1 , . . . , b 349 } ≡ {y 350 , y 351 , . . . , y 699 }, and so on.
The control circuitry can be implemented in any number of ways. Within the context of this paper, the logic operates as follows. Each time a new incoming sample is clocked into the delay line, the control selects one of the four sub-preamble sequences to be used for correlations in that sample clock cycle. The selection is based on the pipeline decision state of previous sub-preamble detections, with reference to when the sample that just exited the delay line was the incoming sample. If a detection was triggered for the exiting sample, then the correlation taps progress to the next sequence (until another new sample is clocked in). Signal timing is acquired after four sub-preamble detections have triggered in succession.
IV. HARDWARE PROTOTYPE VALIDATION
A selection of hardware prototypes were built for reception of the arbitrary-phase HOPS spread spectrum waveform [16] and implemented on an Intel Arria 10 SoC FPGA, including: (1) a brute-force 3 matched-filter model, (2) a 1-bit truncated coefficients model, (3) a truncated coefficients model with λ = √ 2/2 pruning, and (4) a truncated pruned coefficients model with 4× folded correlation taps. The HOPS signals are 3 The hardware prototype HOPS system employs an 8 symbol preamble and 175 chip spread ratio. Since the Arria 10 FPGA is limited to 3374 multipliers, the 3 · 8 · 175 = 4200 multiply operations required by the brute-force design are clocked at a higher rate to fit on the device. constructed in hardware using digital chaos-based spreading codes taken from an arbitrary uniform distribution of 2 k equally-spaced phase words. Given the hardware prototypes employ k = 8, similar results can be expected for any digital chaotic sequence-based spread spectrum waveform (k ≥ 8.).
Of primary interest is the hardware reductions achieved by the computational enhancements, the comparative utilization numbers in Table 1 were taken from the relevant Quartus fitter reports, and focus on the use of adaptive logic modules (ALMs), combinational adaptive look-up tables (ALUTs), dedicated registers, and digital signal processing (DSP) blocks. The most significant reduction is the elimination of hardware multipliers, a major advantage of truncation to 1-bit coefficients. An application-specific design could likely benefit more from the adder-less λ = √ 2/2 pruned correlations, since it is not limited by the static embedded structure of an FPGA, although the 17% ALM reduction from (2) to (3) is notable. Model (4) provides the most dramatic hardware reduction, where the 70% ALM reduction from (3) to (4) is on par with a correlator of 1/4th size. Also of interest is the measured preamble detection performance for the hardware prototypes. All of the non-correlator modules were synthesized using the same Verilog hardware description language (HDL) source, including the actual phase / frequency offset estimator circuits. The thresholding scheme does behave slightly different between variants -to ensure accurate results, the trigger level was set by empirically searching for a threshold that gives the best performance without returning false detections.
The measured probability of detection (P D ) is shown with respect to signal-to-noise ratio (SNR) in Fig. 5 with the degradation at P D = 0.9 highlighted in Fig. 6 and Fig. 7 . As expected, the computationally reduced models do yield reduced performance, yet in a very controlled manner. Given the transformation of complex multipliers to sign-selected adder trees, the loss of 5.49 dB for (2) is tolerable. Model (3) reduces the degradation by 3.43 dB, demonstrating the inherent noise cancellation properties of the amplitude-selective collapse of truncated coefficients. The 2.10 dB performance loss of (4) is the most promising, offering performance almost identical to (3), while providing an overall 76% ALM reduction, requiring 82% fewer dedicated registers, and using no DSP blocks.
V. CONCLUSION
This paper proposed a variety of candidate improvements to the brute-force fallthrough correlator structure, allowing significant computational efficiency improvements and hardware utilization reductions with minimal degradation to preamble detection performance. The truncation of multi-bit precision matched-filter coefficients to 1 bit offers a consolidation of FPGA resources from the brute-force 4200 embedded multipliers and 79000 ALMs to no multipliers and 72801 ALMs. The amplitude-selective collapse of complex-valued coefficients into a single real or imaginary correlation tap further reduces the hardware logic for an overall detection loss of 2.06 dB. Achieving the most substantial hardware reductions is the 4× folded structure with pipelined detection decisions, using no multipliers and 76% fewer ALMs overall, for the trade of only a 2.10 dB performance loss. Moreover, this approach is completely extensible to the Gaussian-shaped digital chaotic spread spectrum signals.
The processing of outputs from computationally reduced correlators needs to consider the expected correlation peak loss in received signal strength estimations, while phase / frequency offset estimations will also be less accurate. By performing an on-time accumulation of the detected preamble signal, any performance loss can be mitigated at the cost of a single shared full-precision multiply-accumulate circuit and some added processing latency. The correlation of shorter preambles will increase estimation error, as necessary for the folded correlation tap structure and largest hardware logic reduction shown in this paper. In that case, storing sub-correlation peak outputs in appropriately sized reference registers is a potential solution, and the optimal sizing of these registers based on the allowable probability of false accept per stage is considered for future work.
