High-Performance GPU and CPU Signal Processing for a Reverse-GPS
  Wildlife Tracking System by Rubinpur, Yaniv & Toledo, Sivan
ar
X
iv
:2
00
5.
10
44
5v
1 
 [c
s.D
C]
  2
1 M
ay
 20
20
High-Performance GPU and CPU
Signal Processing
for a Reverse-GPS Wildlife Tracking System
Yaniv Rubinpur and Sivan Toledo
Blavatnik School of Computer Science, Tel-Aviv University
May 22, 2020
We present robust high-performance implementations of signal-processing tasks
performed by a high-throughput wildlife tracking system called ATLAS. The system
tracks radio transmitters attached to wild animals by estimating the time of arrival of
packets encoding known pseudo-random codes to receivers (base stations). Time-of-
arrival estimation of wideband radio signals is computatoinally expensive, especially
when it is not known when a transmitter transmits. These computation are a
key bottleneck that limits the throughput of the system. The paper reports on
two implementations of ATLAS’s signal-processing algorithms, one for CPUs and
the other for GPUs, and carefully evaluates their performance. The evaluations,
performed on two CPU platforms and on three GPU platforms, show dramatic
improvements relative to our baseline, a high-end desktop CPU that is typical of the
computers in current base stations. The improvements are both in terms of absolute
performance (more than 50X with a high-end GPU and more than 4X with a GPU
platform consumes almost 5 times less power than the CPU platform), in terms of
performance-per-Watt ratios (more than 16X), and in terms of price-performance
ratios.
1 Introduction
ATLAS is a reverse-GPS wildlife tracking system, targeting mostly regional move-
ment patterns (within an area spanning kilometers to tens of kilometers) and small
animals, including small birds and bats [14, 16]. ATLAS is a collaborative research
effort, not a commercial offering, but it is mature; 6 systems have been set up and
are operating in 5 countries and 3 continents. The first system has been operating
for about 6 years almost continuously (24/7).
ATLAS tracks wild animals by attaching to them miniature radio-frequency
(RF) transmitting tags [13, 15]. These transmissions are received by ATLAS base
stations that include a sampling radio receiver and a computer running Linux or
Windows. The computer performs signal processing on the RF samples, detects
transmissions from tags in the sample stream, estimates the time of arrival (ToA)
1
of the transmissions, and reports the reception times to a server via an internet
connection (usually cellular). The server uses multiple ToA reports of the same
transmission by different base stations to estimate the location of the tag at the
time of transmission.
The signal processing that ATLAS base stations performs is computationally
demanding and is one of the main limiting factors of the throughput of the system
(the number of tags that it can track and the number of localizations per second
that it can produce). The signal-processing algorithms were initially developed with
great attention to performance, but they run on the CPU and are mostly single
threaded.
This paper presents a new implementation of the ATLAS signal-processing code
that is tailored to graphical processing units (GPUs). We also evaluate the per-
formance of both the original code (with an improvement that was meant to allow
an FFT library to exploit multiple cores) and of the new GPU code on real data.
The evaluations, performed on two CPU platforms and on three GPU platforms,
show dramatic improvements relative to our baseline, a high-end desktop CPU that
is typical of the computers in current base stations. The improvements are both
in terms of absolute performance (more than 50X with a high-end GPU and more
than 4X with a GPU platform consumes almost 5 times less power than the CPU
platform), in terms of performance-per-Watt ratios (more than 16X), and in terms
of price-performance ratios. We expect base station computers with GPUs to cost
a little more than GPU-less base stations, but the cost increase is only moderate
(at the low-end, the cost increase appears to be caused by the small market for the
GPU platforms; at the mid and high ends, due to the additional cost of the GPU).
The rest of this paper is organized as follows. Section 2 provides background
on ATLAS base station operation, and in particular on how signal processing tasks
are scheduled. Section 3 explain what the signal-processing tasks actually do, with
an emphasis on the computationally-demanding parts. Section 4 describes how the
algorithms are realized on CPUs and GPUs and the techniques that we employ to
achieve high performance. Our experimental evaluation and its results are presented
in Section 5, and our conclusions and recommendations to users in Section 7.
2 Background
ATLAS base stations receive complex RF samples from a sampling radio. The radio
down-converts the amplified and bandpass-filtered signal from the antenna using a
quadrature mixer, producing a complex analog RF baseband signal that is sampled,
digitally filtered, and decimated. Typical physical sampling frequencies are 32 Ms/s
or 100 Ms/s and typical post-decimation sample rates are 8 Ms/s and 8.33 Ms/s.
The computer at the base station receives a continuous stream of complex 16-bit
integer samples (at a rate of 8 or 8.33 Ms/s). Each packet of samples includes an
accurate time stamp, typically produced by a GPS-disciplined clock. The clock is
triggered to set its value to that of the next whole second by a pulse produced by the
GPS receiver every whole second. Considering typical cable delays and the variance
of the timing of the whole-second pulse, the accuracy of the time stamp is within
tens to hundreds of nanoseconds. (ATLAS does not use these time stamps directly
2
for localization, only to associate detections of the same transmission by different
base stations [11, 16].)
A C-language library collects the continuous stream of complex 16-bit integer
samples and places them into a cyclic buffer that normally contains about 25 s
worth of samples (about 800 MB). The buffer’s data structure allows the library to
find samples from a particular range of absolute times.
The signal-processing scheduler in a base station is implemented in Java and
is driven by two work queues: the acquisition (searching) queue and the tracking
queue. The base station is told which tags it needs to track, what they transmit,
and what is their transmit schedule. The specification of transmission includes the
center frequency, the modulation (frequency-shift keying, FSK, or phase-shift keying,
PSK), the deviation (for FSK only), the symbol rate, and the unique pseudo-random
pattern that each tag transmits. Typical symbol (chip) rates are near 1 Mb/s and
typical pseudo-random sequences are 8192-bit long, so transmissions last about 8 ms.
The transmit schedules are usually periodic with tags transmitting every second, 2 s,
4 s, or 8 s. The phase of the transmit cycle is arbitrary and not known. Once a
transmission from a particular tag is detected, the scheduler knows the phase of the
cycle for that tag and can track it by performing the necessary signal processing on
small windows of RF samples that cover the next expected transmission time with
a small margin designed to account mainly for inaccurate transmit times caused by
limitations of the real-time-clock of the tag, and for limitations of its firmware; these
inaccuracies are typically much larger than the propagation delays. We typically set
the margins to 2 ms at both ends of the expected transmission; this is larger than
necessary, but because the transmission itself lasts 8 ms, shortening the margins
would not produce significant computational savings.
Base stations can discover the transmit phase of a particular tag in one of two
ways. One is through searching, processing as much of the incoming stream of RF
samples as possible in an attempt to detect a transmission from the tag. Thus, any
tag that the base station is told to track but whose transmit phase is unknown is put
into the search queue. Whenever the scheduler decides to perform a searching task,
it instructs the signal processing code to correlate a window of RF samples against
the pseudo-random codes of all the tags in the search queue. If a tag is detected in
this window of RF samples, its transmit phase is registered and it is moved to the
tracking queue. These RF windows are typically 100 ms long, for efficiency. When
the scheduler decides to search again, it will attempt to process the next window of
RF samples, with a 10 ms overlap that guarantees that the transmission of a tag
is fully contained in some window, to maximize the magnitude of the correlation.
It may (and does) happen that by the time the scheduler attempts to process a
particular window of RF samples, these samples are no longer in the cyclic buffer
that holds the samples. In this case, the scheduler moves to the freshest window of
samples in the buffer and continues from there.
Base stations can also be informed of the transmit phase of a tag by the server
of the ATLAS system; this happens when another base station detects the tag and
reports this detection to the base station. In that way, the acquisition load is shared
among the base stations.
Once a base station knows the phase of transmissions from a particular tag, it
3
moves the tag to the tracking queue. When the scheduler decides to perform a
tracking task, it extracts the earliest tracking task from the queue and instructs the
DSP code to correlate a small window of RF samples, typically 12 ms, against the
pseudo-random code of the one tag associated with that task. If the required samples
are no longer in the buffer, the scheduler discards the task and tries the task that
is now earliest. The early-first policy minimizes the likelihood of discarding tasks
whose data is no longer in the buffer, but it can increase the processing latency of
transmissions.
To summarize, searching-mode tasks process 100 ms worth of RF samples and
correlate them against the codes of all the tags in the search queue; tracking-mode
tasks typically process 12 ms worth of samples and correlate them against one code.
The scheduler allocates processing time to searching and tracking tasks according
to a fixed ratio, usually one to one. That is, during periods in which both queues are
non-empty, the base station allocates equal time to searching and tracking, alternat-
ing between the two whenever necessary to keep the processing times balanced, but
without preemption. When one of the queues is empty, the base station naturally
devotes all of its resources to the other queue.
The scheduler is sequential and devotes all the computational resources of the
base station to a single task at a time. This has kept the code relatively simple
and very reliable and it has simplified the optimization of the signal processing code
(which does not have to worry about concurrent tasks competing for resources).
Clearly, this strategy may lower performance if a single task cannot utilize all the
computational resources (cores).
3 Detection and Time-of-Arrival Estimation in AT-
LAS
The algorithmic building blocks that ATLAS uses to detect and to estimate the
time-of-arrival of FSK-modulated transmissions are as follows:
1. Conversion of the complex RF samples, represented by a pair of 16-bit integers
each, to a single-precision (int16_t) complex vector x.
2. Optionally, the complex samples are multiplied elementwise (Hadamard) by a
complex input vector l representing a local-oscillator signal, which shifts the
input signal so that its center frequency is zero. That is, we replace x← x⊙ l
(for all i, xi ← xi · li).
3. Next, an optional bandpass FIR (finite impulse response) filter, which we
represent here by HBP, the circulant matrix that represents the filter, so y ←
HBPx. We use filters with 200 coefficients.
4. Two matched filters are applied to y (or to x if no bandpass filter is used),
one that represent a single-bit (chip) period at the frequency representing a
1 symbol and one that represent a single-bit period at the frequency that
represents a 0 symbol. These matched filters H1 and H0 are short, typically 8
coefficients. We denote the output of the filters by f1 = H1y and f0 = H0y.
4
For phase-shift keying (PSK) transmissions that are processed in differential
mode, we use a similar strategy using two-chip-long filters, one matching a
continuous carrier and the other a carrier with a 180-degree phase shift; see [8]
for details.
5. Now the two complex match-filtered vectors f1 = H1y and f0 = H0y are
used to demodulate the transmission in two different ways, normalized and
unnormalized,
d = (|f1| − |f0|)⊘ (|f0|+ |f0|)
u = (|f1| − |f0|)
(elementwise absolute value, elementwise subtraction and addition, and ele-
mentwise division). Both of these signals are real. The normalized demodu-
lated vector ranges between −1 and +1 and is insensitive to the amplitude of
the incoming transmission; this makes it relatively insensitive to short bursts
of strong interference. The vector d, is used to detect transmissions and esti-
mate their arrival time. The unnormalized vector, u, is linear in the amplitude
of the incoming signal (and is hence vulnerable to interference). It is only used
to estimate the power of the incoming signal.
6. The algorithm applies exactly the same steps to a synthetic noise-free zero-
padded signal r(c) that represents an FSK packet with the same modulation
parameters and a (finite) pseudo-random bit sequence c, resulting in a real
normalized demodulated vector d(c). The lengths of d, u and d(c) are identical.
The replica d(c) is precomputed and stored.
7. The real demodulated vector d is cross-correlated with d(c). The cross corre-
lation is cyclic for efficiency, but the next stages only consider shifts that do
not involve a wrap-around (that is, we only use the part of the output that
is identical to the output of zero-padded cross correlation). This is correct in
searching mode because of the overlap of RF windows, and correct in tracking
mode because the window is guaranteed to include the transmission such that
no overlap occurs. The cross correlation vector is also real.
8. Now the algorithm computes the value and location j of the maximum of
the absolute value of the cross correlation vector, j = argmaxi
∣∣xcorr(d, d(c))
∣∣.
This is used to collect the samples of xcorr(d, d(c)) around j, which are sub-
sequently used to estimate the arrival time of the incoming signal, and to
compute quantities that are used to estimate the signal-to-noise ratio (SNR)
and the energy of the signal. More specifically, assuming that the nonzero part
5
of d(c) spans its first n elements, the computed quantities are
wc =
n∑
i=0
d
(c)
i di+j ,
q =
n∑
i=0
d2i+j , and
pc =
n∑
i=0
d
(c)
i ui+j .
The relative signal strength of the incoming packet is estimated to be (|pc|/‖d
(c)‖)/g,
where g is the receiver’s gain setting. ATLAS estimates the SNR of the de-
modulated signal d, not of the baseband signal, because this ratio is what
controls the accuracy of the arrival-time estimate (more precisely, this ratio
controls the Cramer-Rao lower bound on the variance of the arrival-time es-
timate). The estimated normalized signal amplitude is ac = |wc|/‖d
(c)‖, the
estimated signal-plus-noise energy is e = q − (ac)
2, and the estimated SNR
is a2c/e. The signal processing library returns to the Java code the quantities
j, wc, q, pc and the elements of xcorr(d, d
(c)) that span two bit periods around
j; these 16 or so samples include the entire peak of the cross-correlation vec-
tor; they are used by the Java code to estimate the sub-sample time of arrival
using interpolation.
The Java code uses the SNR to decide whether a window of RF samples contains a re-
ceived transmission with code c; the specific decision rule is explained by Leshchenko
and Toledo [8].
4 High-Performance Implementation of the Signal
Processing Algorithms
We now describe both the CPU implementation and the GPU implementation of
the algorithms described in Section 3.
4.1 The CPU Implementation
The CPU implementation follows certain principles, both algorithmic and structural,
to achieve high performance. Most of these principles are well-known principles in
high-performance computing. The algorithmic principles include:
• Using the fast-Fourier transform (FFT) to compute cross-correlation and to
apply FIR filters with many coefficients, so xcorr(d, d(c)) = ifft(fft(d)⊙fft(d(c))).
• Composed FIR filters are combined; we compute the coefficients of HBPH1
and HBPH0 and apply each combined filter to x using one invocation of the
FFT and one invocation of the inverse FFT.
6
• Using the overlap-add method to apply medium-length filters. This method
breaks a single filter invocation or a single cross-correlation computation into
multiple overlapping applications, where the length of the overlap is at least
the length of the filter. This reduces the cost of filtering a signal of length
m by a filter (or cross-correlation pattern) of length n using the FFT from
Θ(m logm) to Θ(m log n). This represents a significant savings when n≪ m.
When n is a small constant, it is faster to apply the filter directly rather than
using the FFT. Thus, when band-pass filters are used (the usual case), ATLAS
uses the FFT-based overlap-add method on the combined HBPH1 and HBPH0,
which have around 200 coefficients. If no band-pass filter is used, we apply H1
and H0 directly. The overlap-add method is not used to compute xcorr(d, d
(c)),
because in this case m/n is too small to yield a performance improvement.
• Padding of input vector and selection of the parameters for the overlap-add
method that guarantee that input sizes for FFT routines are multiples of small
integers only, usually only 2, 3, 5. This ensures small constants in the FFT
algorithm and avoids a large padding (which would have occurred if we always
pad to powers of 2).
The structural principles that ATLAS uses in C code running on a CPU to achieve
high performance are:
• Use of a comprehensive high-performance FFT library, namely FFTW [5]. By
comprehensive we mean a library that supports all the variants of the FFT
that we need, including real-to-complex, complex-to-real, and multiple FFTs
in one call (for the overlap-add method and for batched cross correlations,
described below).
• Aggressive reuse of allocated arrays. The signal-processing code allocates ar-
rays when it needs to, but normally does not release them; instead, it reuses
them. For example, when asked to perform a demodulation with input pa-
rameters that were yet used, the code allocates all the temporary arrays that
are required to perform the demodulation. These arrays are subsequently used
whenever a window of RF samples needs to undergo demodulation with the
same parameters. The relevant parameters include the RF window size, the
local-oscillator frequency, and the coefficients of HBP, H1, and H0. This prin-
ciple is meant to reduce the overhead of memory allocation, to reduce cache
misses, and to reduce the overhead of FFTW’s planning calls. FFTW requires
that FFT calls be pre-planned; the planning phase optimizes subsequent calls.
The planning is specific to a particular size but also to particular array ad-
dresses, to allow FFTW to use alignment-specific optimizations.
• Alignment of array addresses with cache lines (addresses divisible by 64). This
helps FFTW achieve high performance.
• Precomputation of auxiliary vectors, such as fft(d(c)). This vector is computed
when the code uses a code c for the first time on a particular RF window size.
The vector is retained in an array associated with both c and with the window
size.
7
• Fusion of loops over the same arrays, to reduce cache misses. For example, wc
and pc are computed in one sweep over d
(c). Loop fusion also helps eliminate
temporary arrays, as explained next.
• Elimination of temporary arrays, to reduce memory usage and cache misses.
For example, to demodulate a window of RF samples, the algorithm computes
the vector f1 and saves |f1|. Now the code computes f0 (using an FFT, and also
using the same input and output arrays for the FFT), but does not compute
or store all of the vector |f0|. Instead, the algorithm computes elements of |f0|
one at a time and uses them to compute the corresponding elements of d and
u.
• Batching of cross correlation operations. The interface between the Java sched-
uler and the signal-processing code allows the scheduler to batch requests for
cross correlation operations on vectors of a given length. The scheduler passes
two arrays, one containing the indices of demodulated windows of RF, and
the other containing indices of transformed replicas, where all the arrays con-
taining the input vectors have the same length. The signal processing code
uses a two dimensional array to perform multiple inverse FFTs on multiple
inputs in one call to FFTW. To amortize the cost of memory allocation and
FFT planning, the number of FFTs performed in one such operation is fixed
and is equal to the number of physical cores that the code uses. The code
performs these multiple FFTs until all the requested cross correlations have
been computed. If the number of cross correlations is smaller than that used
in the multiple-FFT plan, the cross correlations are performed one by one.
We note that the reuse of allocated arrays constrains parallelism, because opera-
tions that are fundamentally independent, such as correlations of different codes in
different RF windows, use the same temporary arrays. This design decision was
motivated by the sequential structure of the scheduler.
4.2 The GPU Implementation
To explain how we implemented the same algorithms for GPUs, we first provide a
brief overview of the fundamentals of both the NVIDIA GPU architecture and the
CUDA programming environment that is used to program NVIDIA GPUs.
NVIDIA GPUs contains a large number of simple cores (execution units) under
the control of a smaller number of instruction schedulers. In the TX2 GPU, for
example, 256 cores are organized into warps of 32 cores that are controlled by
a single instruction scheduler. The warps are organized into two two streaming
multiprocessors (SM) with 128 cores each. All the cores in a warp perform the
same operation at the same time, so the code must exhibit a high degree of data
parallelism. Larger NVIDIA GPUs use the same basic structure, but the number of
cores and SMs differ among GPU models. NVIDIA GPUs also have data-movement
engines called copy engines whose role is to copy data between GPU memories and
the main memory of the computer.
CUDA programs express data-parallel computations that can be effectively ex-
ecuted on a GPU using an abstraction called a kernel. Consider, for example, the
8
conversion of an integer array into a floating point array. For example, a simple C
implementation might look like
void int16ToFloat(int16_t* in, float* out , int n) {
int i;
for (i=0; i<n; i++) out[i] = (float) in[i];
}
The CUDA implementation consists of a kernel that operates on a small amount of
data, here one input element and one output element, and a simultaneous invocation
of these kernels on entire arrays. The invocation of the kernel specifies the number
of thread-blocks and the size of each block, here 256 threads. Since all the threads
that run the kernel perform the same operations, the code must ensure that threads
that were allocated work outside the actual extent of the arrays do nothing.
__global__ void
cudaInt16ToFloat(int16_t *gin , float *gout , int n) {
int i = blockDim .x * blockIdx .x + threadIdx .x;
if (index < n) gout[i] = (float) gin[index];
}
void myCudaFunction(int16_t *gin , float *gout , int n) {
...
cudaInt16ToFloat <<<n+255/256,256>>>( gin , gout , n);
...
}
The quantity blockDim.x tells the kernel the number of threads in each block, here
256; threadIdx.x tells the code the index of the particular running thread in the
block; blockIdx.x tells the code the index of the thread block that the thread
belongs to. These quantities allow the thread to determine which pieces of data it
needs to operate on.
CUDA code assumes that pointers refer to memory that the GPU can access
directly. On some platforms the GPU cannot access main memory, so C code running
on the CPU must ask the GPU to allocate memory for arrays and to transfer data
back and forth, as follows.
void myCpuFunction(int16_t* cin , float* cout , int n) {
...
cudaMalloc(&gin , n*sizeof(int16_t )); /* allocates GPU memory */
cudaMalloc(&gout , n*sizeof(float) );
cudaMemcpy(gin , cin , n*sizeof(int16_t), cudaMemcpyHostToDevice );
myCudaFunction(gin , gout , n);
cudaMemcpy(cout , gout , n*sizeof(float), cudaMemcpyDeviceToHost );
...
}
We can now list the guiding principles in our GPU implementation and explain
how the GPU implementation differs from the CPU implementation.
• The GPU code also aggressively reuses allocated memory and in general does
not release it. The same principle is applied to memory on both the CPU and
the GPU.
9
• Similarly, the GPU code precomputes and stores vectors that are used repeat-
edly.
• Most importantly, the GPU code minimizes data movement between the mem-
ories of the CPU and the GPU. In early stages of this research, the code moved
data back and forth between memories during the processing of a single win-
dow of RF samples. The code performed poorly. Profiling showed that the
poor performance was caused by excessive data transfers. We subsequently
restructured the code so that every window of RF samples that is processed is
copied to the GPU, processed completely there, and only the final results are
transferred back to main (CPU) memory.
• We use the NVIDIA high-performance FFT library, cuFFT. In principle, its
interface is similar to that of FFTW, including the splitting of the planning
phase from the execution phase, but the actual interface is different. The
cuFFT library also includes an FFTW compatibility layer, but it operates on
arrays in main memory, not in GPU memory (to allow it to be used as a plug-
in replacement for FFTW, requiring no CUDA code). We initially used this
layer, but this caused poor performance due to back-and-forth data transfers,
so we switched to the cuFFT native interface, which expects pointers to refer
to GPU memory.
• We use a library of CUDA collective operations called CUB [9] to compute
inner products, which are part of the summary data structure that the signal
processing code outputs for every cross-correlation operation. Reductions,
like sums and inner products, are nontrivial to implement well in CUDA.
One approach to this difficulty is to rely on a library of reductions, like those
available in NVIDIA’s cuBLAS library. We used this approach initially, but
this prevented us from fusing multiple reductions on the same data, namely
the computation of wc, q, and pc (the demodulation phase also suffered from
the same problem). CUB allows CUDA programs to express such reductions
in a simple way; CUB itself takes care of most of the implementation details.
This allowed us to fuse these reductions and to improve performance.
• We still use cuBLAS, mainly find the maximum element in an array. In our
algorithm, this computation cannot be fused with other operations, so using
CUB for this would not have any advantage over simply calling the appropriate
cuBLAS function.
The CUDA implementation does not batch multiple cross correlation. This is pos-
sible, but the performance is good enough without this optimization.
The CUDA implementation of the code is portable and has been tested on a
number of (very) different GPU models belonging to different NVIDIA architec-
tures. During the development process, performance was mostly assessed on a
GeForce 1050 GPU in a desktop computer.
10
5 Experimental Evaluation
This section presents our experimental evaluation of the effectiveness of GPUs for
our task, in terms of both performance and energy efficiency.
5.1 Methodology (Test Data)
To test the codes, we modified the CPU-based DSP C code so that it stores all its
inputs and outputs in files. We then ran the ATLAS base station code in an ad-hoc
mode (that is, not as part of a localization system) on a computer connected to a
USRP B210 sampling radio and configured the base station to detect a tag that
was present in the room. This produced files that contained the RF samples that
were processed in both searching and tracking mode, inputs that represent filter
coefficients and the signal to correlate with, and the outputs of the signal-processing
algorithms.
Next, we wrote a C program that reads these files, calls the signal processing
routines on the recorded data, measures their running time and optionally the power
consumption of the computer and its components, and stores the results in files. The
program can use the recordings in both single-code single-RF-window mode and in
batch mode that processes many codes in one call. The former is typical of tracking
mode and the latter of searching mode. The program checks that the returned
results are identical, up to numerical rounding errors, to those returned by the full
base station run that detected the tag correctly. This ensures that all the results
that we report represent correct executions of the algorithms. The code then stores
the running times and the power measurements, if made, to log files.
We also tested that the new CUDA-based code works correctly when called from
Java through the JNI interface and detects transmissions from tags and their arrival
times. This test was performed on the Jetson TX2 computer described below and
the same URSP B210 radio.
5.2 Platforms
We evaluated the code on several platforms using both the CPU code and the GPU
code.
Our baseline is a small form-factor desktop computer, representative of those
currently used in ATLAS base stations, with an Intel i7-8700T CPU. This CPU has
6 physical cores running at clock frequencies between 2.4 and 4 GHz and thermal
design power (TDP) of 35W. This CPU was launched in Q2 2018 and is fabricated in
a 14 nm process. The computer ran Linux kernel version 5.3. We compiled the code
using GCC version 7.5. Both our code and FFTW version 3.3.8 were compiled using
the optimization options that are built into FFTW. The code that was produced
ran slightly faster than code compiled with only -O3 -mtune=native.
Our main target is a low-power Jetson TX2 computer [2, 4], which has a 256-
core NVIDIA Pascal GPU, four ARM Cortex-A57 cores and two ARM Denver2
cores, launched in Q2 2017 using a 16 nm process. The Cortex-A57 cores were
designed by ARM and the Denver2 cores were designed by NVIDIA for higher
single-threaded performance; both use the same 64-bit ARMv8 instruction set. It
11
Mode Name Denver2 Cores A57 Cores GPU Frequency
Max-Q — 4× 1.2 GHz 0.85 GHz
Max-P All 4× 1.4 GHz 4× 1.4 GHz 1.12 GHz
Max-P ARM — 4× 2.0 GHz 1.12 GHz
Max-P Denver 2× 2.0 GHz — 1.12 GHz
Max-N 4× 2.0 GHz 4× 2.0 GHz 1.30 GHz
Table 1: Standard power modes on the Jetson TX2.
also has 8 GB of shared memory with 59.7 GB/s memory bandwidth. This computer
ran Linux kernel 4.9.140-tegra. Both our code and FFTW were compiled using nvcc
version 10.0.326 and gcc version 7.4.0. We used the CUDA library version 10.0.130,
CUB version 1.8.0, and FFTW version 3.5.7.
The power-vs-performance profile of the TX2 can be adjusted by turning cores
on or off and by changing their clock frequency. NVIDIA defined several standard
modes, which we use below in our tests. Table 1 describes these modes. The nominal
TDP of the TX2 ranges from 7.5 W for the highest power efficiency mode, to 15 W
for the highest performance modes. Both the TX2 module and the motherboards
include power sensors that we use to measure the power consumption directly in our
tests.
We also ran the GPU code on two additional platforms. One is an NVIDIA
GeForce 1050 GTX GPU. This GPU uses the Pascal architecture, 640 cores running
at 1.455 GHz, and 2 GB of RAM. The TDP is 75 W. It was plugged into a desktop
running Windows 10 with a quad-core Intel i5-6500 CPU. The last GPU platform
that we used is an NVIDIA Titan Xp GPU. This GPU also uses the Pascal archi-
tecture and has 3840 cores running at 1.582 GHz. It has 12 GB of memory and
a high-bandwidth memory interface. The thermal design power is 250 W. It was
plugged into a server with a 10-core Intel Xeon Silver 4114 CPU.
5.3 Results
Figure 1 shows the performance of our C implementation on the baseline platform,
which has an Intel i7-8700T CPU. We present the performance in terms of the ratio
of processing time per pattern relative to the length of the RF window. That is, if
the code takes 1 s to process one 100-ms window of RF samples and to correlated the
demodulated signal with 16 different code patterns, then we report the performance
as (1/16)/0.1 = 0.625. A ratio of 1 implies that the base station can search for one
tag continuously, that searching for 2 tags would drop 50% of the RF samples, and
so on. A ratio of 0.1 implies that the station can search continuously for 10 tags
without dropping any RF sample, and so on. Lower is better.
The results on one core (Figure 1 left) show that the performance per pattern
improves significantly when we process multiple patterns in one window of RF sam-
ples (which is how the experiment was structured, since this is typical given how
ATLAS systems are usually configured). This is mostly due to the amortization of
the cost of demodulation over many patterns. The graph on the right in Figure 1
that using 2 or 3 cores improves performance relative to using only one core, but
12
1  2  4  8  16 32 64 128 256
number of patterns
0
0.1
0.2
0.3
0.4
0.5
tim
e 
pe
r p
at
te
rn
 re
la
tiv
e 
to
 re
al
 ti
m
e
i7-8700T CPU (1 core)
searching
tracking
1 2 3 4 5 6
number of cores
0
0.05
0.1
tim
e 
pe
r p
at
te
rn
 re
la
tiv
e 
to
 re
al
 ti
m
e
i7-8700T CPU
searching (16 patterns)
tracking (4 patterns)
Figure 1: The performance of the DSP code on one CPU core (left) and its speedup
on multiple cores. The vertical bars show the miniumum and maximum values over
10 experiments, and the actual data points are median results of the 10 experiments.
The number of RF windows is 10 in the searching experiments and 100 in the
searching experiments.
the improvement is far from dramatic or linear. Using 4 or more cores actually
slows the code down relative to 2 or 3 cores. The parallelization in the CPU code is
only within FFTW and it does not appear to be particularly effective in this code,
perhaps due to the length of the FFTs and to an Amdahl-type bottleneck.
Performance on the TX2 is excellent on the GPU but poor on the CPU, as shown
in Figure 2. Our CUDA code running on the TX is about 4.3 times faster than the
single-core i7 code and about 3 times faster than the i7 multicore runs. However,
even at the highest performance mode, the TX’s CPU cores perform about 4 times
worse than the i7. We also measured the power consumption of the TX2 while it
was running our code. The results, shown in Figure 3, indicate that when running
the GPU code, the GPU is the largest power consumer, but the memory and other
parts of the system-on-chip (most probably the memory interface) consume a lot of
power, about 50% of the total. The CPU and IO interfaces also consume power, but
not much. In the C-code runs, the GPU is essentially off; the CPU, memory, and
system-on-chip are the largest power consumers. The graph on the right in Figure 3
shows that the CUDA code is about 10 times more energy efficient than the C code
running on the CPU, for the same task.
Figure 4 shows that our CUDA code is also very effective on desktop and server
GPUs. A low-end GPU 12.8 times faster than a single x86_64 desktop core that is
2 years newer. A server GPU is 51.4 times faster than the desktop CPU.
6 Related Work
Alawieh et al. [1] and Hendricks et al. [7] compare the performance of several types
of compute nodes, including GPUs, CPUs, and FPGAs, in the context of RF ToA
estimation, with application to a location estimation system called RedFIR. These
papers are, therefore, closely related to our work, in the sense that the workload is
13
1  2  4  8  16 32 64 128 256
number of patterns
0
0.5
1
1.5
2
2.5
3
3.5
tim
e 
pe
r p
at
te
rn
 re
la
tiv
e 
to
 re
al
 ti
m
e
TX2 CPU (1 core)
Max-Q
Max-P ARM
Max-P Denver
Max-N
1  2  4  8  16 32 64 128
number of patterns
0
0.05
0.1
0.15
0.2
tim
e 
pe
r p
at
te
rn
 re
la
tiv
e 
to
 re
al
 ti
m
e
TX2 GPU
Max-Q
Max-P ARM
Max-P Denver
Max-N
Figure 2: Searching performance on the Jetson TX2 on both the ARM cores (left)
and the GPU (right) under four standard power configurations.
TX2 Power Consumption
CPU Max-Q
CPU Max-P ARM
CPU Max-P Denver
CPU Max-N
GPU Max-Q
GPU Max-P ARM
GPU Max-P Denver
GPU Max-N
0
5
10
15
pe
ak
 p
ow
er
 (w
)
GPU
CPU
RAM
SOC
IO
CPU Max-Q
CPU Max-P ARM
CPU Max-P Denver
CPU Max-N
GPU Max-Q
GPU Max-P ARM
GPU Max-P Denver
GPU Max-N
10-2
10-1
100
Approximate TX2 Energy Consumption
Figure 3: Power consumption during searching tasks (left), broken down by system
component, and approximate total energy consumption during searching with 16
patterns, normalized to the largest energy expenditure (right). Both graphs show
the data for searching on either the CPU or the GPU of the Jetson TX2 and under
four standard power configurations. The rated accuracy of the power sensors is 2%
for values above 200 mW and 15% for smaller values.
14
1  2  4  8  16 32 64 128
number of patterns
0
0.02
0.04
0.06
tim
e 
pe
r p
at
te
rn
 re
la
tiv
e 
to
 re
al
 ti
m
e
GeForce GTX 1050 GPU
searching
tracking
1  2  4  8  16 32 64 128
number of patterns
0
0.005
0.01
0.015
0.02
tim
e 
pe
r p
at
te
rn
 re
la
tiv
e 
to
 re
al
 ti
m
e
Titan Xp GPU
searching
tracking
Figure 4: The performance of two desktop GPUs, a low-end (and somewhat old)
GeForce GTX 1050 and a high-end Titan Xp.
very similar. In particular, one of the platforms that Hendricks et al. [7] evaluate
is the Jeston K1, a predecessor of the Jetson TX2 that we evaluate in this paper.
However, there are also significant differences between our work and the work on
high-performance energy-efficient ToA estimation for RedFIR. First, RedFIR per-
forms real-time signal processing, whereas ATLAS buffers samples and performs
DSP on these buffered samples using a priority scheduler. Thus, ATLAS processing
benefits from high throughput and does not require low or fixed latency, making it
easier to achieve high performance. Second, RedFIR signals are phase modulated
and the correlation is performed on complex baseband samples, whereas ATLAS sig-
nals are usually frequency modulated, leading to a fairly different signal-processing
sequence (Leshchenko and Toledo provide more details on the differences [8]). Third,
RedFIR does not rely on the periodic nature of tags’ transmissions, requiring it to
correlate incoming data with all the stored patterns, not only with one pattern that
is associated with a transmission at that particular time; this is similar to how our
code operates in acquisition (searching) mode, not in the more efficient tracking
mode.
Not much else has been published on high-performance and GPU implemen-
tations of ToA estimation. Obviously, much work has been done on evaluations of
GPUs in the broader area of signal processing, including both one-dimensional (time
dependent) RF and audio signal processing and multi-dimensional signal process-
ing on images and antenna or microphone arrays. For example, Belloch et al. [3]
describe the use of a multi-GPU system localization system based on a massive mi-
crophone array. Kim et al. describe a prototype of an acoustic location-estimation
system in which GPUs are used both for complex baseband cross correlation and
for particle-filter location-estimation; they report results with both a Jeston TX1,
another predecessor of the TX2, and with a more powerful Tesla K40c GPU.
Several authors describe high-performance FFT implementations on GPUs [6,
10]. These papers, including the two cited, often claim higher performance than that
of cuFFT, the FFT library developed by NVIDIA. Howver, there is no production-
quality alternative to cuFFT. In addition most of these papers are about a decade
15
old and we assume that some of the innovations described in them have been imple-
mented since in cuFFT as well. Střelák and Filipovič [12] investigate how to best
use cuFFT and provide both informal advice and a software tool that can guide
users to invocations that achieve high performance. Our code follows their informal
advice.
Yang et al. [17] describe the pitfalls of implementing real-time algorithms on
GPUs and suggest techniques to avoid these pitfalls. Our implementation avoids
most of the pitfalls that they describe, essentially because the ATLAS base station
scheduler schedules one DSP computation at a time, avoiding concurrency pitfalls.
This is possible partially thanks to aggressive buffering of the RF samples, which
permits flexibility in scheduling. The price that the ATLAS system pays for this,
namely a delay of up to 30 s in computing localizations, has no significance for
current applications of ATLAS (but may have some implications for applications
that depend on triggering environmental manipulations when an animal crosses a
particular geographic boundary).
7 Discussion and Conclusions
We have shown that by implementing the DSP functionality of an RF time-of-arrival
transmitter localization system in CUDA, we can improve the acquisition (searching)
throughput of the system by a factor of 4 while reducing power consumption by a
factor of 3 or so relative to a baseline single-core C code, even though the C code
has been carefully optimized. Table 5, which summarizes the characteristics of our
test platforms (as well as of a few newer platforms) show that higher-end GPUs can
improve throughput dramatically higher, at the cost of higher power consumption,
and sometimes also higher cost. The throughput of tracking modes also improves
on GPU platforms.
We acknowledge that our baseline code does not exploit effectively multicore
CPU platforms, and that a careful parallel multicore implementation, perhaps in
OpenMP, can probably improve the performance of the CPU code on multicore
CPUs. This implies that our results should not be taken as a general comparison
between the performance and power-performance ratio of GPUs versus CPUs, but as
a comparison that is relevant to signal processing tasks similar to ours. We note that
our CPU implementation does use a (high-quality) parallel multicore FFT library;
this alone does not deliver good parallel speedups, perhaps due to the modest size
of the tasks. We conclude that although multicore parallelization of this code is
possible, it does not appear to be a simple parallelization task.
The CUDA code is not particularly complicated, and our subjective feeling is
that achieving high performance using CUDA programming and a GPU is more
effective for this type of signal-processing tasks than implementing parallel multi-
core algorithms on CPUs.
We from this study draw two operational conclusions that operators of ATLAS
systems should follow. One is that ATLAS base stations that run on grid power
and are not power constrained should be equipped with desktop-type computers
with GPUs that run CUDA. Because NVIDIA GPUs for desktop computers are
invariably add-on cards, it is possible to equip base stations initially with computers
16
Device Launch Cores Fab W USD tput
i7-8700T Q2 2018 6× x86 14 nm 35+ 1000 6
Jetson TX2 Q2 2017
6× ARM
+256
16 nm 7.5–15 1000 26
Jetson Nano Q1 2019
4× ARM
+128
16 nm 5–10 750
Jetson Xavier Q3 2018
8× ARM
+512
12 nm 10–30 —
GeForce GTX 1050 Q2 2016 +384 14 nm 75+ 110+ 77
GeForce GTX 1650 Q3 2019 +896 12 nm 75+ 200+
Titan Xp Q2 2017 +3840 16 nm 250+ 1200+ 315
Figure 5: The main characteristics of the four platforms that we evaluated in this
paper, as well as of several newer options that might be more appropriate for de-
ployment. The cores column shows the number and architecture of CPU cores and
the number of GPU cores (without showing their architecture, which also affects
performance). The 5th column shows the TDP of the platform, either the overall
power consumption or, if marked by a +, of only the device itself. The cost in USD
is only indicative, and again shows either the total system cost or, when marked by
a +, the cost of the device. The rightmost column shows the throughput, defined as
the number of codes (tags) that can be searched for without dropping any RF sam-
ples, assuming batches of 128 and windows of 100 ms each; this also assumes that
only 50% of the time is devoted to searching, the rest to tracking. The performance
of the i7 processor assumes that only one of the six cores are used; see text.
17
without GPUs and to add GPUs as more tags are deployed. The initial purchase
should, however, allow adding a GPU in terms of expansion slots, physical space in
the enclosure, and a large-enough power supply. The relatively low-cost of desktop
GPUs implies that the added cost of the GPU (and of the enclosure and power
supply that can support it) is moderate.
The second operational conclusion is that ATLAS base stations that rely on
wind and solar power harvesting should use Jetson-series platforms with ARM pro-
cessors and built-in GPUs. As Table 5 shows, there are several options to choose
from (at least in principle; see below). This allows operators to choose a model that
best fits their power budget and their throughput requirements. Unfortunately,
commercially-available Jetson computers are about as expensive as high-end desk-
tops with a desktop-class GPU (perhaps because they do not sell in large numbers),
and the latter combination offers higher throughput. This is the reason that base
stations connected to the power grid should use conventional desktops with GPUs:
these offer both higher throughput and better price-performance ratios than Jetson
computers.
From the software-maintenance point of view, it is unfortunate that we cannot
compile the CUDA code into C (or better yet, into C with OpenMP directives),
because we now have to maintain two separate DSP codes for ATLAS.
Acknowledgements We gratefully acknowledge the support of NVIDIA Corpo-
ration with the donation of the Jetson TX2 used for this research. This study was
also supported by grants grants 965/15, 863/15, and 1919/19 from the Israel Science
Foundation.
References
[1] Mohammad Alawieh, Maximilian Kasparek, Norbert Franke, and Jochen
Hupfer. A high performance FPGA-GPU-CPU platform for a real-time locat-
ing system. In Proceedings of the 23rd European Signal Processing Conference
(EUSIPCO), pages 1576–1580, Aug 2015.
[2] Tanya Amert, Nathan Otterness, Ming Yang, James H. Anderson, and
F. Donelson Smith. GPU scheduling on the NVIDIA TX2: Hidden details
revealed. In Proceedings of the IEEE Real-Time Systems Symposium (RTSS),
pages 104–115, 2017.
[3] Jose A. Belloch, Alberto Gonzalez, Antonio M. Vidal, and Maximo Cobos.
On the performance of multi-GPU-based expert systems for acoustic localiza-
tion involving massive microphone arrays. Expert Systems with Applications,
42:5607–5620, 2015.
[4] Dustin Franklin. NVIDIA Jetson TX2 delivers twice the in-
telligence to the edge, March 2017. NVIDIA Developer Blog,
https://devblogs.nvidia.com/jetson-tx2-delivers-twice-intelligence-edge.
18
[5] Matteo Frigo and Steven G. Johnson. The design and implementation of
FFTW3. Proceedings of the IEEE, 93(2):216–231, 2005. Special issue on “Pro-
gram Generation, Optimization, and Platform Adaptation”.
[6] Naga K. Govindaraju, Brandon Lloyd, Yuri Dotsenko, Burton Smith, and John
Manferdelli. High performance discrete Fourier transforms on graphics proces-
sors. In Proceedings of the ACM/IEEE Conference on Supercomputing (SC),
pages 1–12, Nov 2008.
[7] Arne Hendricks, Thomas Heller, Andreas Schäfer, Max Kasparek, and Dietmar
Fey. Evaluating performance and energy-efficiency of a parallel signal correla-
tion algorithm on current multi and manycore architectures. Procedia Computer
Science, 80:1566–1576, 2016. Part of a special issue devote to papers from the
2016 International Conference on Computational Science (ICCS).
[8] Andrey Leshchenko and Sivan Toledo. Modulation and signal-processing trade-
offs for reverse-GPS wildlife localization systems. In Proceedings of the European
Navigation Conference (ENC), pages 154–165, 2018.
[9] Duane Merrill. Cub (cuda unbound) library version 1.8.0, 2018.
A library of CUDA collective primitives; Available online at
https://nvlabs.github.io/cub/.
[10] Sayantan Mitra and Ashok Srinivasan. Small discrete Fourier transforms on
GPUs. In Proceedings of the 11th IEEE/ACM International Symposium on
Cluster, Cloud and Grid Computing, pages 33–42, May 2011.
[11] Saeed Shojaee, Johannes Schmitz, Rudolf Mathar, and Sivan Toledo. On the
accuracy of passive hyperbolic localization in the presence of clock drift. In
Proceedings of the IEEE International Symposium on Personal, Indoor, adn
Mobile Radio Communication (PIMRC), pages 1–6, October 2017.
[12] David Střelák and Jiří Filipovič. Performance analysis and autotuning setup
of the cuFFT library. In Proceedings of the 2nd Workshop on Autotuning and
Adaptivity Approaches for Energy-Efficient HPC Systems (ANDARE). ACM,
2018. 6 pages.
[13] Sivan Toledo, Oren Kishon, Yotam Orchan, Yoav Bartan, Nir Sapir, Yoni Vort-
man, and Ran Nathan. Lightweight low-cost wildlife tracking tags using inte-
grated tranceivers. In Proceeings of the 6th Annual European Embedded Design
in Education and Research Conference (EDERC), pages 287–291, Milano, Italy,
September 2014.
[14] Sivan Toledo, Oren Kishon, Yotam Orchan, Adi Shohat, and Ran Nathan.
Lessons and experiences from the design, implementation, and deployment of
a wildlife tracking system. In Proceeings of the IEEE International Conference
on Software Science, Technology and Engineering (SWSTE), pages 51–60, Beer
Sheva, Israel, June 2016.
19
[15] Sivan Toledo, Yotam Orchan, David Shohami, Motti Charter, and Ran Nathan.
Physical-layer protocols for lightweight wildlife tags with Internet-of-things
transceivers. In Proceedings of the 19th IEEE International Symposium on
a Wolrd of Wireless, Mobile, and Multimedia Networks (WOWMOM), pages
1–4, June 2018. work-in-progress paper, to appear.
[16] Adi Weller-Weiser, Yotam Orchan, Ran Nathan, Motti Charter Anthony J.
Weiss, and Sivan Toledo. Characterizing the accuracy of a self-synchronized
reverse-GPS wildlife localization system. In Proceeings of the 15th ACM/IEEE
International Conference on Information Processing in Sensor Networks
(IPSN), pages 1–12, Vienna, Austria, April 2016.
[17] Ming Yang, Nathan Otterness, Tanya Amert, Joshua Bakita, James H. Ander-
son, and F. Donelson Smith. Avoiding pitfalls when using NVIDIA GPUs for
real-time tasks in autonomous systems. In Sebastian Altmeyer, editor, Proceed-
ings of the 30th Euromicro Conference on Real-Time Systems (ECRTS), pages
20:1–20:21, 2018.
20
