Abstract: A novel architecture for running cross-correlation and convolution using bitstream processing is proposed. The computationally intensive multiplications inherent in cross-correlation and convolution are replaced by simple logic operations (AND XOR) using bitstream representation. The reduced complexity enables compact and energy efficient silicon solutions suitable for small, portable devices such as wearable heart beat detecting electronics embedded in the actual ECG patch.
Introduction:
The time-domain operation known as cross-correlation is a computationally intensive algorithm due to the large number of multiplications required.
The index I is the shift or offset parameter between two sampled signals x(n) and y(n).
For each shift in time all n samples within the correlation window must be multiplied and accumulated, i.e. results summed. In a system operating a running correlation these n multiplications must be performed in parallel (or multiplexed at higher speed). Since multiplication is a power hungry (or slow) operation, power efficient hardware for cross correlation is challenging to make. However, by changing the representation or coding of the signal, hardware efficient equivalents of running cross-correlators exists.
Bitstream processing: A popular data conversion technique is known as Δ−Σ modulation using over-sampling to move in-band noise to higher frequencies (noise shaping). As the sampling rate is increased, the precision of the quantiser is relaxed. proposing addition of bitstreams. In [2] and [3] more complex computations are proposed like reducing multiplication to simple AND gates. Also use of bitstreams for filtering is found in the literature [4] . In this letter we will extend bitstream processing to general cross-correlation or convolution.
In the following we will estimate the figure-of-merit (FOM) for the multiplier based 
Cross-correlation computation:
A feasible hardware implementation of discrete-time, linear cross-correlation using a standard digital approach is shown in Error! Reference source not found.. A sequence of samples of length n (history) must be stored for both the incoming signal and the template. A multiply-and-accumulate hardware is multiplexed at n times the Nyquist clock rate.
The linear cross-correlation of length n is computed for each shift of the incoming signal by cycling through both the stored signals and the template and accumulating the result in the adder. This assumes a single cycle multiplier, to avoid requiring a higher clock frequency. In [5] a power optimized and area efficient array multiplier architecture is reported. A rough, but optimistic estimate of transistor count extracted from the paper is shown in table 1, where m denotes be number of bits of the multiplier.
We have to include the transistors used for storing signal history and template profile.
The latches proposed in [6] are the most efficient that is in extensive use. We adopt a differential 12 transistor static flip-flop based on this topology. The multiplexer could be implemented using a bus structure, but for simplicity we implement a multiplexer tree A typical Δ−Σ modulators [7] with OSR=8 and 14 bits resolution will reduce transistor count of the cross-correlator with a factor of 13! For OSR>32 minor or no improvements are expected using bitstream processing.
The continuous cross-correlation computed at the oversampling rate is mixed with significant high frequency noise. As for all Δ−Σ modulators decimation is required. A simple first-order low-pass filter (sinc) of the signal can be made by averaging the computed results down to Nyquist rate. Depending on the modulator order more elaborate decimators must be used. The increased complexity of the decimator will increase hardware complexity somewhat, but since moving average filtering is often used in decimator, no multiplication is required. Simple adders are used and the number of adders increases with the decimator order. The increased hardware demand is therefore minor.
In order to evaluate the signal processing quality on real signal a complete model of the bitstream cross-correlator was programmed in MATLAB and compared to the MATLAB xcorr() function. As signal source ECG measurements from the MIT-HIB database was used (10bits@250Hz). A sigma-delta bitstream was created using a simple first-order modulator with linear interpolation and 64 times oversampling. The template was created from the first beat and upconverted to a bitstream. The simple first-order decimation (as proposed above) was used aiming at reliable heartbeat detection.
The upper trace in Error! Reference source not found. shows the cross-correlation using bitstream processing while the lower trace is a "true" cross-correlation using the xcorr() function in MATLAB. Although the bitstream cross-correlation still contains some noise components, heartbeats are easy to detect using simple thresholding.
It should be noted that convolution can be done using the same hardware by simply reversing the template sequence. Since convolution in time-domain is equivalent to multiplication in the frequency domain, efficient filters are done by generating the appropriate template. Another improvement is substitution of the AND gate with an XOR gate correlating also '0' states in the bitstreams. In a signal processing prospective multiplication can be simplified to addition/subtraction since both signals are scaled.
Conclusion: In this letter we have presented a novel, generic hardware implementation of running cross-correlation/convolution architecture using bitstream processing suitable for power efficient implementation is silicon. The advantage of bitstream processing is highest for low oversampling ratios and higher resolution. 
