A DSP ACCELERATION FRAMEWORK FOR SOFTWARE-DEFINED RADIOS ON X86 64 by Georgis, Georgios et al.
A DSP ACCELERATION FRAMEWORK FOR SOFTWARE-DEFINED RADIOS ON X86 64
Georgios Georgis, Alexios Thanos, Marcin Filo, Konstantinos Nikitopoulos
5G Innovation Centre, Institute for Communication Systems, University of Surrey, Guildford, UK
ABSTRACT
This paper presents a DSP acceleration and assessment framework
targeting SDR platforms on x86 64 architectures. Driven by the
potential of rapid prototyping and evaluation of breakthrough con-
cepts that these platforms provide, our work builds upon the well-
known OpenAirInterface codebase, extending it for advanced, pre-
viously unsupported modes towards large and massive MIMO such
as non-codebook-based multi-user transmissions. We then develop
an acceleration/profiling framework, through which we present fine-
grained execution results for DSP operations. Incorporating the lat-
est SIMD instructions, our acceleration framework achieves a uni-
tary speedup of up to 10×. Integrated into OpenAirInterface, it ac-
celerates computationally expensive MIMO operations by up to 88%
across tested modes. Besides resulting in a useful tool for the com-
munity, this work provides insight on runtime DSP complexity and
the potential of modern x86 64 systems.
Index Terms— SDRs, parallel processing, SIMD, MIMO
1. INTRODUCTION
Software-Defined Radios (SDRs) have been nothing short of a
breakthrough in making wireless systems research accessible. Hav-
ing programmable platforms in place of an inflexible microchip,
brought versatility and modularity via commercial off-the-shelf
(COTS) devices in the form of 2× 2 transceiver slices [1, 2, 3].
Their modular design allows platform components to be added
transparently and with reduced effort. Combined with in-house
software/firmware development, one can then conduct otherwise
prohibitively expensive research into wideband and diverse trans-
missions, multiple-input multiple output (MIMO) configurations,
new antenna designs and DSP optimization.
Even though baseband complexity based on SDRs has been
studied in the past, up to now this has mostly been addressed on
a theoretical basis, considering application-specific processors and
architectures that can be considered slightly outdated for future
wireless standards [4, 5]. Hybrid programmable architectures are
gaining ground due to their reconfigurability and shorter develop-
ment cycle [6]. Especially with the arrival of the AVX512 instruction
set, enabling 512-bit Single Instruction Multiple Data (SIMD) sup-
port in General-Purpose Processors (GPPs), the latter have become
an attractive and powerful option for rapid prototyping [7]. In
this direction, a number of GPP-based LTE stack solutions have
surfaced, each one with a different roadmap and business model
[8, 9, 10, 11]. Among those, OpenAirInterface (OAI) is the most
feature-rich (i.e., open-source, functional split support, non-realtime
emulation, UE implementation, core network). OAI can be used to
This work was supported through UK’s DCMS Testbeds and Trials
projects “AutoAir” and “AutoAir Phase 2”. The Authors would also like
to thank the members of the University of Surrey’s 5G Innovation Centre
(http://www.surrey.ac.uk/5GIC) for their support.
deploy a low-cost 3GPP cellular network using COTS SDRs and
standard linux-based PCs. OAI’s large developer community and
the effort of multiple partners renders this the most popular and
advanced publicly-available SDR platform. Despite OpenAirInter-
face leading the open-source competition, important functionalities
are still incomplete or missing; e.g., multilayer and non-codebook-
based transmission schemes in the case of 4G, with 5G lacking even
a bidirectional link. Of similar importance for a GPP-based Radio
Access Network (RAN) is supporting advanced SIMD instructions.
OAI takes limited advantage of newer instruction sets, as these were
originally unsupported by the hardware upon its release and has only
recently begun considering such features to improve performance
[12]. Assessment of these aspects in GPP SDRs has been up to
now very limited and restricted to the inherently supported modes
[13, 14, 15]. This has left “uncharted territory” between plain soft-
ware execution and hardware-assisted offloading, i.e., a feature that
up to now has been considered inevitable for all computationally
expensive operations. Evaluating GPP performance in the context of
SDR acceleration is thus crucial, as a means to explore this territory.
Aiming to address these shortcomings, in this work we present:
a) an SIMD acceleration toolbox for DSP operations, supporting
AVX2 and AVX512 instructions, b) a software-accelerated linear
precoder/detector supporting large MIMO transmissions, c) an ac-
celeration / assessment framework for Orthogonal Frequency Divi-
sion Multiplexing (OFDM) operations and d) a fine-grained profil-
ing framework. To the best of our knowledge, it is the first time
in the open literature that an SDR acceleration framework has been
presented and profiled to such an extent. We also note our contri-
butions enabling advanced, previously unsupported features within
the stack of OpenAirInterface; dynamic beamforming and multi-user
MIMO (MU-MIMO) for time-division duplexing (TDD). Section 2
describes our acceleration/profiling framework, Section 3 provides
experimental results and Section 4 concludes.
2. DSP ACCELERATION FRAMEWORK
Our acceleration and profiling framework is transparent to the un-
derlying 3GPP release, the transmission mode and the system’s
dimensionality. To highlight the impact of our contributions, we
target MIMO operations, extending OAI with features previously
unsupported. We added all necessary signaling for multi-layer
non-codebook based precoding (changes spanning up to the Radio
Resource Control (RRC) protocol). We introduce a new sched-
uler supporting up to four users over the same resources in both
uplink (UL) and downlink (DL). Finally, we introduce support for
the necessary baseband operations: a) channel estimation based on
demodulation (UE-specific) reference signals (DMRS), b) modula-
tion/demodulation, c) rate (de)matching and d) precoding/detection.
Figure 1 illustrates the physical layer (L1) flow, highlighting op-
timized/extended features. Sub-blocks indicate the granularity of
our profiling framework and grey blocks depict modules benefiting
To
 ra
di
os
Fr
om
 ra
di
os
O
FD
M
D
em
od
ul
at
io
n
C
ha
nn
el
Es
tim
at
io
n
M
IM
O
D
et
ec
tio
n
Q
A
M
D
em
od
ul
at
io
n
D
es
cr
am
bl
in
g
R
at
e
D
e-
m
at
ch
in
g
D
ei
nt
er
le
av
in
g
C
ha
nn
el
D
ec
od
er
O
FD
M
 
M
od
ul
at
io
n
B
ea
m
Pr
ec
od
in
g
Q
A
M
M
od
ul
at
io
n
Sc
ra
m
bl
in
g
R
at
e
M
at
ch
in
g
In
te
rle
av
in
g
C
ha
nn
el
En
co
de
r
Reciprocity
Calibration
Weight
Calculation
To L2
From
 L2
Extended
Optimized + Extended 
Cell Downlink (DL)Cell Uplink (UL)
     : MIMO Precoding Operations
Fig. 1: Extended and optimized procedures in OAI’s L1 flow.
by our acceleration framework. Targeting reusability, we currently
avoid focus on standard-specific error correction.
2.1. SIMD Acceleration
OAI relies on fixed-point (i.e., integer) arithmetic throughout its L1
procedures [16], coded via low-level intrinsics for 128-bit SIMD
(SSE) instructions (apart from Fourier transforms which use AVX2).
Complemented by 16 bits per fixed-point I/Q sample, this allows
processing up to 4 complex values within one 128-bit register. We
introduce an SIMD acceleration framework which: a) supports SSE,
AVX2 and AVX512 datapaths, b) focuses on MIMO operations, c)
targets reusable DSP functions and d) supports series lengths N not
evenly divisible by the SIMD vector size vs (i.e., N -vs).
OpenAirInterface’s physical layer DSP functions operate on
complex time series between vectors and scalars. Our framework
maintains the use of intrinsics for finer control over automatic com-
piler vectorization [17]. Regarding scalar/vector-vector addition and
scalar-vector multiplication, we extend the vector width and the loop
stride to enable the wider register instruction sets. In these cases,
most of the SSE intrinsics have an AVX2/512 counterpart with a
similar set of micro instructions.
OAI’s complex multiplication and rotation though revolve
around the mm madd epi16 intrinsic (i.e., pmaddwd instruction)
which horizontally multiplies and adds two 128-bit vectors across
16-bit boundaries. Just to use this instruction, OAI’s complex ro-
tation performs non-contiguous input memory accesses, creating a
significant penalty which is aggravated in wider SIMD instructions.
Since OAI’s version creates 32-bit intermediate results, two 128-bit
vectors are required to hold 4 complex numbers. We target elim-
ination of horizontal operations, taking into account OAI’s 16-bit
representation and shifts. By initially replicating real and complex
elements, we merge the pmaddwd, psrad, punpckldw and packssdw
operations into the pshufb, pmulhrsw and paddsw instructions which
maintain a 16-bit output. This generates fewer and lower latency
instructions [18], having significant impact for a loop that is ex-
ecuted Nofdm
vs
· (Ncellant )2 ·Nsfsyms times per subframe (e.g., the
multadd cpx vector and apply/remove 7 5 kHz functions written
similarly [19]). Nofdm denotes the number of samples in an OFDM
symbol, vs the SIMD vector size in samples, Ncellant the number of
cell antennas and Nsfsyms the subframe’s length in OFDM symbols.
Due to a non-fixed arithmetic shift and the requirement for 32-bit
accumulation, the above optimizations are not applicable to OAI’s
complex dot product. In that case and for the final AVX2 sum-
mation we employ the permute2x128 si256 intrinsic, extracting the
output via extract epi32. In the case of AVX512, we employ re-
duce add epi32 and packs epi32, manually composing the 32-bit
output. We also eliminate any leftover MMX 64-bit instructions
which incur an additional penalty requiring to empty the MMX state
register at the end.
2.1.1. Accelerated Zero Forcing precoding/detection
A well-known linear method in large and massive MIMO research,
[20, 21], Zero Forcing (ZF) can achieve high spectral efficiency
gains whenNBSantNlayers (i.e., the number of layers). While con-
sidered to be simple–at least algorithmic-wise, due to the complex
matrix multiplications and inversions involved, it has non-trivial
computational and storage complexity, polynomially increasing
with Ncellant and Nlayers. Our software ZF precoding/detection
module supports AVX2/512 intrinsics, operating in parallel on a
two-dimensional grid of antennas and resource elements. Our mod-
ule is highly-configurable in terms of NRE (i.e., active resource
elements), supporting up Ncellant = Nlayers = 16 in this instance.
Both detection/precoding assume distinct channel matrices per sub-
carrier. Although Nsfsyms is configurable, channel inversions are
assumed to be performed once per subcarrier and group of OFDM
symbols within a scheduling unit (e.g., 14/28 symbols for 15/30kHz
spacing [22]). Our precoder/detector operates on 16-bit fixed-point
input/output performing single-precision calculations internally.
2.1.2. OFDM subframe-based processing
OFDM operations consist of discrete fourier transforms (dft), an op-
tional shift of the zero-frequency component to the center of the
spectrum (4G uplink), magnitude normalization and cyclic prefix re-
moval/addition [22]. Particularly in large MIMO systems, these op-
erations have a fixed yet significant complexity [23]. We developed
a flexible subframe processing framework with modular dft kernels
chosen at compile time. While dft acceleration libraries do exist: a)
FFTW [24], Intel’s b) MKL [25] and c) IPP [26], there has–to the
best of our knowledge–been no aggregate evaluation of them in a
3GPP context for SDRs [12]. Our OFDM framework allows the run-
time definition of physical resource blocks NRB , slots to schedule,
Ncellant , µ for subcarrier spacing [22] and repetitions for averaging.
2.1.3. OAI integration
All enhancements were added via preprocessor definitions allowing
compile-time SIMD choice per function on all x86 64 platforms.
We rewrote OAI’s allocation functions to employ posix memalign
instead of the deprecated memalign, to align the allocated buffers
according to vector size boundary requirements [17]. Up to now,
OAI only considered data series lengths N to be evenly divisible by
the SIMD vector size (vs |N ). While this works in the vast major-
ity of the SSE cases, it will lead to segmentation faults when vs -N
(e.g., for wider vs). To make our framework more robust, we include
per function and per vs length checking. A main loop processes data
up to the point where its length is evenly divisible by vs. Remain-
ing samples are then copied into an aligned static array whose length
equals to vs. The vector pointers are then reinitialized to point into
the array. This guarantees execution of the same intrinsics as those
in the main function loop on an input with aligned boundaries. Our
precoding/detection is integrated via a custom header/wrapper and a
statically linked library. Finally, we note that since all accelerated
OAI Baseline
add ve
ctor16
and sh
ift
sub cp
x vect
or16
dot pr
oduct
compl
ex con
jugate
mult/a
dd cpx
(conj)
vector
rotate
cpx ve
ctor
0
2
4
6
8
10
Sp
ee
du
p
SSEvs|N AVX2vs|N AVX512vs|N
SSEvs-N AVX2vs-N AVX512vs-N
Fig. 2: Proposed DSP toolset speedup v. OAI’s SSE baseline (red
line, Intel Core i9-7980xe, b |a denotes a divisible by b).
dft libraries perform floating-point arithmetic, we developed SIMD
conversion functions (including length checking). We keep a similar
function interface to that of OAI’s dfts and employ preprocessor def-
initions whithin the full L1 stack for compile-time dft library choice.
The latter is also statically linked to the OpenAirInterface binaries.
2.2. Profiling Framework
Based on OAI’s time-stamping mechanism (Time Stamp Counter,
TSC) for Intel GPPs, we introduce a dedicated performance counter
per L1 DSP function. Since most of these functions are executed sev-
eral times within a radio frame, our performance counter also keeps
track of the total number of function calls and subsequently calcu-
lates mean and minimum/maximum execution times. Our modifi-
cations allow multiple levels of granularity, from a single physical
layer operation to aggregate, per subframe assessment. Hence, we
can -in a streamlined/automated manner- get an accurate assessment
of L1 procedures and explore the effect of different transmission
modes and MIMO dimensions on execution time.
3. ASSESSMENT METHODOLOGY AND RESULTS
We note that generalizing functionality (e.g., bandwidths, 3GPP re-
leases) within a full stack requires significant development effort
and as such, for proof of concept we present full stack (i.e., within
the context of OAI) measurements for a bandwidth of 5MHz (i.e.,
NRE = 300). Nevertheless, tests and results presented outside
OAI’s context are not bandwidth-limited. We profile execution on
an Intel Core i9-7980XE, i.e., a representative of modern GPPs sup-
porting AVX512 instructions. The processor operates at 2.6 GHz
running CentOS 7.6 with linux kernel 5.3 and 64GB of DDR4 RAM
running at 2666 MHz. Intel’s hyperthreading, turbo and low-power
states were all switched off. We employ versions 3.3.3 for FFTW, 19
update 4 for Intel’s libraries and 8.3.1 for the gcc/g++ compilers1.
3.1. Tests outside OAI’s context
We created unitary C testbenches with runtime-configurable fixed-
point input range, data size and iterations. All functions were
executed for N = 4096 randomized complex samples (i.e., a+jb)
with a, b uniformly distributed in [−1, 1) and execution time av-
eraged over 106 iterations. For our precoder/detector we compare
AVX2/512-optimized execution with that of a plain C model. Our
1Compilation flags used: -O3 -march=skylake-avx512 -mtune=skylake-
avx512 -mprefer-vector-width=512 -fvect-cost-model=unlimited.
Tx-
FFT
W
Rx-
FFT
W
Tx-
MK
L
Rx-
MK
L
Tx-
IPP
Rx-
IPP
Tx-
OA
I
Rx-
OA
I
Tx-
FFT
W
Rx-
FFT
W
Tx-
MK
L
Rx-
MK
L
Tx-
IPP
Rx-
IPP
Tx-
OA
I
Rx-
OA
I
0
100
200
300
400
500
15kHz-AVX512 (AVX2 OAI) 30kHz-AVX512 (AVX2 OAI)
E
xe
cu
tio
n
Ti
m
e
(u
s)
Nofdm=512 Nofdm=1024 Nofdm=2048 Nofdm=4096
Fig. 3: OFDM Rx/Tx subframe execution (NBSant=1).
testing scenarios assume Ncellant ∈ {2, 4, 8, 16} and Nlayers ∈
{2, 4}. We measure subframe-based (i.e., Nsfsyms = 14), single-
threaded execution times, corresponding to NRE = 1272 (i.e.,
20MHz bandwidth [27]). Regarding OFDM assessment, initial-
ization is performed once at the start of each execution. The term
Tx/Rx subframe respectively corresponds to downlink/uplink only
subframes. We showcase averaged results for Nofdm ∈ {512,
1024, 2048, 4096}, encompassing NRE ∈ {300, 624, 1272,
3240} and NRE ∈ {288, 612, 1272, 3276} at {5, 10, 20, 50}
and {10, 20, 40, 100}MHz bandwidth for 15 and 30 kHz subcarrier
spacing respectively [27]. We note that all enhancements maintain
or surpass the original quantization performance.
2x2 4x2 4x4 8x2 8x4 16x2 16x4
101
102
103
104
NBSant ×Nlayers, (NRE=1272)
E
xe
cu
tio
n
Ti
m
e
(u
s)
Detect-plainC Precode-plainC Detect-AVX2
Precode-AVX2 Detect-AVX512 Precode-AVX512
Fig. 4: Proposed Detector/Precoder: Plain C v. AVX2/512.
Figure 2 shows unitary speedup (and execution penalty in the
cases where vs -N = 4095) for the most characteristic functions/
function groups. Achieved gains are within the sublinear region as
expected. In particular, we observe average AVX2 speedups of 1.9×
and average AVX512 speedups of 2.9−3.9× for functions which
haven’t been significantly redesigned (e.g., add vector16 and shift,
sub cpx vector16). In the case of scalar/vector-vector multiplica-
tion, our optimisations accelerate OAI’s out-of-the-box execution by
1.7−3.4×, rising up to 4.5−10.3× (using SSE and AVX512 respec-
tively). Penalty for vs -N v. vs |N is generally negligible apart from
complex rotation with AVX2 instructions (6.2× v. 6.8×) and vector
subtraction with AVX512 instructions (1.0× v. 3.9×) - the latter po-
tentially having a small impact only on user-equipment-side.
Single-antenna results in Fig. 3 display that as expected, OFDM
subframe execution time nearly doubles with subcarrier spacing. In-
tel’s MKL surpasses the performance of all other dft libraries, the
gain being more pronounced in higher bandwidths (e.g., 15% faster
v. IPP, 28% faster v. OpenAirInterface and 50% faster v. FFTW
0 500 1,000 1,500 2,000 2,500 3,000 3,500 4,000 4,500
2x2
2x2 opt.
4x2
4x2 opt.
4x4
4x4 opt.
8x2
8x2 opt.
8x4
8x4 opt.
16x2
16x2 opt.
16x4
16x4 opt.
OAI Cell Rx Subframe Execution Time (us)
OFDM Demod. SRS Chan. Est.
DMRS Chan. Est. Detection
Freq. Equalization LTE iDFT
Demapping Unscrambling
Rate Dematching Turbo Decoding
Deinterleaving Remaining
0 500 1,000 1,500 2,000 2,500
OAI Cell Tx Subframe Execution Time (us)
OFDM Mod.
Recipr. Calibration
Weight Calculation
Turbo Encoding
Interleaving
Rate Matching
Scrambling
Mapping
Beam Precoding
Remaining
Fig. 5: Integrated L1 profiling: multi-user MIMO Rx/Tx subframe procedures pre/post- proposed optimizations (NRE=300).
for Nofdm = 4096 and Tx subframes at 30 kHz spacing). Its
performance is followed by the AVX512-enhanced IPP and OAI’s
libraries, the latter only supporting AVX2 (though still compiled
and tested with the flags listed in this section). Regarding Zero
Forcing MIMO detection/precoding, AVX512 surpasses AVX2 by
1.89× / 1.59× respectively, attaining an order of magnitude benefit
when compared to non-vectorized C code. (Fig. 4). Execution time
stays below 600µs even for the challenging 16×4 MU-MIMO case.
3.2. Integrated OAI tests
We measure cell execution (i.e., the most computationally demand-
ing) via OpenAirInterface’s emulation mode. The latter simulates
only the radio equipment and the channel (handled by separate pro-
cesses), the remaining protocol stack running as it would on a re-
altime system. We present single-threaded physical layer results
with any extra workers disabled in the configuration files. We note
that our tests on OpenAirInterface’s current threading architecture
showed marginal gains on Frequency Division Duplexing (FDD)
schemes only. The cell was set to operate at 3.5GHz using Time
Division Duplexing (with 50% uplink/downlink slot allocation) in
monolithic mode (i.e., frontend/baseband processing taking place on
the cell). For single host execution, we configure the cell and user-
equipments’ packet-based interfaces in internal loopback. Following
the initialization of all processes within the stack and the establish-
ment of cell to user equipment connection (via the Random Access
Channel-RACH procedure), we generate bidirectional (i.e., uplink
and downlink) data traffic via iPerf [28]. We gather results over the
course of 60 minutes for the aforementioned MIMO configurations.
Figure 5 presents results integrating our profiling contribu-
tions and our AVX512-accelerated framework (denoted with “opt”,
bars shaded with dashed lines depicting affected operations). We
compare against OAI’s default SIMD datapath, except for Zero
Forcing precoding/detection which employs non-vectorized code
as a baseline (i.e., there was no pre-existing code for this op-
eration). Listed procedures involve uplink (i.e., Rx, Fig. 5-left)
and downlink shared channels (i.e., Tx, Fig. 5-right), for physical
layer operations scheduled on a subframe basis (i.e., Nsfsyms = 14
[22]). “(De)mapping” includes layer (de)mapping and QAM
(de)modulation, while “Remaining” refers to outstanding physi-
cal channels and control/software overheads.
Regarding runtime execution of Rx subframes, MIMO detec-
tion benefits by more than 75% across all tested cases. SRS/DMRS
channel estimation respectively benefit by 47/30%, the latter affect-
ing total subframe execution significantly more than the former (up
to 1
5
of the total Rx execution time in 16 × 4 non-optimized MU-
MIMO - Fig. 5-left). Remaining operations also benefit by 80% on
average. Total Rx subframe execution speedup ranges from 1.17×
(2×2MU-MIMO) to 1.95× for 16×4MU-MIMO, effectively reduc-
ing execution time from 4.4 to approximately 2.2ms (Fig. 5-left). Tx
precoding can take up to half of the total execution time in high order
MU-MIMO (Fig. 5-right). On average, our optimisations accelerate
Weight Calculation by 69% and Beam Precoding (i.e., weight appli-
cation) by 48%. Integrating MKL’s dft libraries and our SIMD con-
version functions incurs a 29% average OFDM modulation speedup.
Average Tx subframe acceleration is up to 1.6× (NBSant=16).
While exceeding the 1ms barrier breaks realtime execution, the
opposite is not always true. Generalizing and quantifying all de-
pendencies (front-end, radio and over-the-air latency, threading ar-
chitecture), is well-beyond the scope of this work. Overall, we ob-
serve channel decoding to be the most expensive Rx operation, 9×
more demanding compared to LTE’s Turbo encoder (executed on
an SSE datapath for OAI). DMRS channel estimation can be as de-
manding in 16×4 MIMO, followed by OFDM procedures and detec-
tion/precoding. Results indicate that more pronounced gains should
be expected in higher bandwidths.
4. CONCLUSIONS
This work presented an SIMD acceleration framework for SDRs.
Harnessing AVX512, our contributions speed up unitary OAI SSE
operations by up to 10× and more involved procedures by up to an
order of magnitude compared to non-vectorized code. Following
significant functional extensions to support advanced transmission
schemes, our integrated contributions enhance OAI subframe exe-
cution by up to 1.95×. Our regression testing/profiling framework
facilitates assessment and highlights the potential/limits of modern
x86 64 systems. Future work will focus on exploiting hardware-
assisted offloading and multicore execution on top of our SIMD
framework, as well as on channel coding.
5. REFERENCES
[1] Ettus Research, “Ettus Research, a National Instruments Com-
pany,” https://www.ettus.com/site/about, 2019.
[2] Fairwaves, “XTRX, The first truly embedded SDR,” https://
www.crowdsupply.com/fairwaves/xtrx, 2019.
[3] Skylark Wireless, “Skylark Wireless, Broadband for the next
billion,” http://www.skylarkwireless.com/, 2019.
[4] Woh M., Lin Y., Seo S., Mudge T., and Mahlke S., “Analyzing
the scalability of simd for the next generation software defined
radio,” in 2008 IEEE Int. Conf. on Acoust., Speech and Signal
Process., March 2008, pp. 5388–5391.
[5] P. Westermann and H. Schro¨der, “On the scalability of simd
processing for software defined radio algorithms,” in 2010 Int.
Conf. on Embedded Comput. Syst.: Architectures, Model. and
Simul., July 2010, pp. 309–317.
[6] G. Sklivanitis, A. Gannon, S. N. Batalama, and D. A. Pa-
dos, “Addressing next-generation wireless challenges with
commercial software-defined radio platforms,” IEEE Com-
mun. Mag., vol. 54, no. 1, pp. 59–67, January 2016.
[7] C. S. Anderson, J. Zhang, and M. Cornea, “Enhanced vector
math support on the Intel AVX-512 architecture,” in IEEE 25th
Symp. on Comput. Arithmetic (ARITH), Jun. 2018, pp. 120–
124.
[8] Amarisoft, “LTE 100, a full software LTE solution,” https:
//www.amarisoft.com/, 2019.
[9] I. Gomez-Miguelez, A. Garcia-Saavedra, P. D. Sutton, P. Ser-
rano, C. Cano, and D. J. Leith, “srsLTE: An open-source plat-
form for LTE evolution and experimentation,” in Proc. 10th
ACM Int. Workshop Wireless Netw. Testbeds, Exp. Eval. Char-
acterization. 2016, WiNTECH ’16, pp. 25–32, ACM.
[10] N. Nikaein, R. Knopp, F. Kaltenberger, L. Gauthier, C. Bonnet,
D. Nussbaum, and R. Ghaddab, “Demo: OpenAirInterface:
An open LTE network in a PC,” in Proc. 20th Annu. Int. Conf.
Mobile Comput. and Netw. 2014, MobiCom ’14, pp. 305–308,
ACM.
[11] openLTE, “OpenLTE; an open source implementation of
the 3GPP LTE specifications,” https://sourceforge.net/projects/
openlte/, 2019.
[12] Shah S., “Proposed features for OpenAirInterface,” https://
trello.com/c/cbhbyc46/94-performance-improvements, 2019.
[13] C. Y. Yeoh, M. H. Mokhtar, A. A. A. Rahman, and A. K.
Samingan, “Performance study of lte experimental testbed us-
ing openairinterface,” in 2016 18th Int. Conf. on Adv. Commun.
Technol. (ICACT), Jan 2016, pp. 617–622.
[14] Z. Geng, X. Wei, H. Liu, R. Xu, and K. Zheng, “Performance
analysis and comparison of gpp-based sdr systems,” in 2017
7th IEEE Int. Symp. on Microw., Antenna, Propag., and EMC
Technol. (MAPE), Oct 2017, pp. 124–129.
[15] Gringoli F., Patras P., Donato C., Serrano P., and Grunenberger
Y., “Performance assessment of open software platforms for
5g prototyping,” IEEE Wireless Commun., vol. 25, no. 5, pp.
10–15, October 2018.
[16] Labrosse J. J., Embedded Systems Building Blocks, Complete
and Ready-to-Use Modules in C, R&D Technical Books, 1995.
[17] Voss M., “Topics in loop vectorization,” https://www.cs.utexas.
edu/∼pingali/CS377P/2019sp/lectures/vectorization-voss.pdf,
2018.
[18] Intel, “Intel Intrinsics Guide,” https://software.intel.com/sites/
landingpage/IntrinsicsGuide/, 2019.
[19] Eurecom, “openairinterface5G git repository,” https://gitlab.
eurecom.fr/oai/openairinterface5g, 2019.
[20] C. Shepard, H. Yu, N. Anand, E. Li, T. Marzetta, R. Yang, and
L. Zhong, “Argos: Practical many-antenna base stations,” in
Proc. 18th Annu. Int. Conf. Mobile Comput. and Netw. ACM,
2012, pp. 53–64.
[21] S. Malkowsky, J. Vieira, L. Liu, P. Harris, K. Nieman, N. Kun-
dargi, I. C. Wong, F. Tufvesson, V. O¨wall, and O. Edfors, “The
world’s first real-time testbed for massive MIMO: Design, im-
plementation, and validation,” IEEE Access, vol. 5, pp. 9073–
9088, 2017.
[22] 3GPP, “Physical Channels and Modulation,” Technical Report
(TR) 36.211, 3rd Generation Partnership Project (3GPP), Dec.
2018, Version 15.4.0.
[23] H. S. Stone, “Parallel processing with the perfect shuffle,”
IEEE Trans. Comput., vol. C-20, no. 2, pp. 153–161, Feb.
1971.
[24] Frigo M. and Johnson S. G., “FFTW,” http://www.fftw.org/,
2018.
[25] Intel, “Intel R© Math Kernel Library,” https://software.intel.
com/en-us/mkl, 2018.
[26] Intel, “Intel R© Integrated Performance Primitives,” https://
software.intel.com/en-us/intel-ipp, 2018.
[27] 3GPP, “Base Station (BS) radio transmission and reception,”
Technical Report (TR) 36.104, 3rd Generation Partnership
Project (3GPP), Dec. 2018, Version 15.4.0.
[28] “iPerf - The TCP, UDP and SCTP network bandwidth mea-
surement tool,” https://iperf.fr/, 2019.
