The vision, as we move to future wireless communication systems, embraces diverse qualities targeting significant enhancements from the spectrum, to user experience. Newly-defined air-interface features, such as large number of base station antennas and computationally complex physical layer approaches come with a non-trivial development effort, especially when scalability and flexibility need to be factored in. In addition, testing those features without commercial, off-the-shelf equipment has a high deployment, operational and maintenance cost. On one hand, industry-hardened solutions are inaccessible to the research community due to restrictive legal and financial licensing. On the other hand, researchgrade real-time solutions are either lacking versatility, modularity and a complete protocol stack, or, for those that are full-stack and modular, only the most elementary transmission modes are on offer (e.g., very low number of base station antennas). Aiming to address these shortcomings towards an ideal research platform, this paper presents SWORD, a SoftWare Open Radio Design that is flexible, open for research, low-cost, scalable and software-driven, able to support advanced large and massive Multiple-Input Multiple-Output (MIMO) approaches. Starting with just a single-input single-output air-interface and commercial off-the-shelf equipment, we create a software-intensive baseband platform that, together with an acceleration/profiling framework, can serve as a research-grade base station for exploring advancements towards future wireless systems and beyond.
I. INTRODUCTION
The race to future generation wireless communication systems has brought a competition among operators, vendors and research institutes in the quest to be labelled 'worldfirst'. Targeted developments are multi-faceted, aiming to improve latency, reliability, data rate, connectivity and network efficiency by at least an order of magnitude. Yet, supporting all these aspects would first require an evolutionary approach to assess these aspects on a system-wide level. There is therefore a need to have a platform that allows: a) validation of diverse and/or computationally heavy approaches, e.g., cases with a large number of antennas or user equipment (UE) devices, or non-linear (NL) techniques, The associate editor coordinating the review of this manuscript and approving it for publication was Fang Yang . without the need of cost-prohibitive and time-consuming development effort required for real-time (RT) operation; b) accelerating digital signal processing (DSP) techniques (e.g., multiple-input multiple-output (MIMO) detection) and profiling execution time to highlight and indicate real-time barriers and finally, c) testing MIMO approaches in a RT environment (e.g., mobility with actual signaling) for select configurations.
Large enterprises have access to a huge pool of resources enabling in-house system-on-chip (SoC) designs for base and mobile station development. Such designs can serve as a flexible reference but are clearly outside the reach of the research and development community, including academia.
Hence, an open, research-grade testbed, flexible enough to support the diverse deployment scenarios, but still affordable, would be critical for boosting innovation and accelerating VOLUME 7, 2019 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/ development towards the revolutionary approach that next generation mobile communications really require. Towards this goal a large market for off-the-shelf software-defined radios (SDRs) exists, that, accompanied with in-house development can lead to a proprietary, next-generation-ready platform [1] - [3] . An ideal research platform could be characterized by four key attributes:
• Accessibility: To be a viable tool for a large part of the research and development community the platform has to be of low cost and needs to be built from commercial off-the-shelf (COTS) equipment that is widely available.
• Completeness: A full-stack implementation allows to evaluate the effectiveness and impact of new approaches/solutions on a system-wide level. Here, we explicitly refer to performance metrics such as throughput as well as the overall processing load and latency. Such a thorough evaluation can review ideas at an early stage and therefore maximize their impact.
• Versatility: To maximize its research merit, the platform needs to support wide-reaching experiments; e.g., investigating the impact of diverse transmission environments, different MIMO configurations, new antenna designs, advanced signal processing and software/hardware optimization methods.
• Modularity: The ability to add or modify components with reduced effort and minimal repercussions or modification to other parts of the platform. This concept applies to hardware (e.g., antenna arrays, processing units) or software-defined components (e.g., precoding algorithms, channel estimation procedures). The previous discussion clearly illustrates the need for cost-effective development which research-grade radios and software-centric architectures running on general purpose processors (GPPs) can provide. While such solutions exist, they are either closed-source [4] or offer just the plainest transmission modes (TMs) [5] - [7] . Among those, OpenAir-Interface (OAI) is perhaps the most advanced and the only open-source platform actively following 3rd Generation Partnership Project (3GPP) standardization. Despite the large community and joint development effort of multiple Ope-nAirInterface Software Alliance (OSA) partners, a number of features in OAI such as large MIMO and beamforming are still experimental, incomplete, or missing.
Existing research-grade testbeds target deployments on a massive scale that would be open to everyone for experimentation. One prominent example is COSMOS [8] , a city-scale testbed for next generation wireless technologies aimed to be deployed in the city of New York. It is an ambitious ongoing work that's currently deployed campus-wide, with a roadmap that finishes by the end of 2020. COSMOS has an impressive backend, but mostly targets application layer research such as full-duplex narrowband, exploring optical x-haul, edge computing and smart city deployments. Its focus is not on physical layer, apart from some mmWave channel measurements made possible with the support of IBM [9] , [10] . A similar project is POWDER that harnesses the expertise of the RENEW team [11] , [12] . While both projects are impressive and backed by a US$ 100M grant, they seem to primarily focus on application layer research. POWDER aims to provide an open foundation platform using baseline OAI and srsLTE deployments, leaving research and experimentation to the platform's users. POWDER supports up to four channels with USRPs and a massive MIMO path with proprietary radios that, as we discuss in Section III-A, have their limitations in terms of interfacing and bandwidth. Other testbeds such as Lund University's LuMaMi [13] explore MIMO on a massive scale. The aforementioned platform employs Universal Software Radio Peripheral (USRP) devices as its frontend, with the rest of the system being tied to National Instruments' proprietary, expensive and inflexible MIMO framework, both hardware-and software-wise [14] . Thus, its reach is limited to lower physical layer (PHY) functions (i.e., it lacks forward error correction) and to linear algorithmic approaches. MASS-Start [15] , [16] is another massive MIMO project that harnesses the potential of open source but is limited to hybrid beamforming and to no more than two layers.
This work presents SoftWare Open Radio Design (SWORD), a platform that aims to provide a soft-driven, open for research collaboration, flexible, programmable, extendable and radio-agnostic research and development testbed for large MIMO systems. We acknowledge that a softwaredriven solution is the most promising way of meeting all the key attributes of accessibility, completeness, versatility and modularity. In this direction, SWORD takes OAI as a baseline and extends it to a full multi-user (MU)-MIMO platform. In particular and to the best of the authors' knowledge, SWORD is the first platform that combines all the following advanced features: I A novel non real-time (NRT) over-the-air (OTA) mode, which comprises an architecture/software co-design approach that enables rapid OTA testing of advanced algorithms for large-MIMO systems-potentially across the whole stack-focusing on non codebook-based, single-and multi-user MIMO scenarios. This enables rapid prototyping and OTA evaluation of novel approaches for MIMO systems. II A signal processing acceleration framework based on the single-instruction multiple-data (SIMD) architectural concept, which harnesses the newest Advanced Vector Extensions (AVX2) and AVX512 instructions of GPPs. III Support for built-in time-division duplexing (TDD) reciprocity calibration. IV A detailed profiling framework to accurately measure execution time. Through our profiling and acceleration framework we quantify processing bottlenecks and provide insights into the limits of software-based baseband processing. V RT single-layer dynamic beamforming support.
To the best of our knowledge, this is the first time in the open literature that an SDR platform has been profiled to such a fine-grained extent and also the first time that an acceleration framework has been presented [17] - [19] .
Similarly, although OAI has already been used to demonstrate single layer beamforming with a COTS UE [16] , to the best of our knowledge this is the first time an SDR-based platform is used to demonstrate 1) a single-layer dynamic beamforming with OAI running on both ends, and 2) multiuser MIMO scenarios. Furthermore, our novel NRT OTA mode provides our platform with a unique capability which allows OTA testing of advanced, computationally intensive approaches within the context of 3GPP-compliant MIMO systems. To the best of our knowledge, none of the existing platforms support such a mode of operation.
SWORD is an extendable platform with which we currently display results for up to 16 base station antennas. We employ SWORD to show that via our acceleration framework, we accelerate the execution of 16×4 MIMO operations by up to 91% compared to non-vectorized code and up to 61% compared to OAI's SIMD acceleration. In addition, we exploit our profiling framework to also indicate the limits of modern GPPs and SIMD, both for AVX2 and AVX512 systems. Hence, we provide evidence for the performance one would expect to attain through this open SDR platform.
While this work mainly focuses on the platform's enhancements, we nevertheless present initial indicative OTA measurements as an application of SWORD's modes. We consider two sets of experiments, mainly used for validation purposes. The first set relies on SWORD's RT mode, to test channel ageing at pedestrian speeds for single-layer beamforming. The second set harnesses SWORD's NRT-OTA mode to validate recently proposed advanced non-linear algorithms that would otherwise require significant investment in software/hardware optimization.
The rest of the paper is organized as follows: Section II intends to motivate the reader to understand the need for such a platform; we do this by providing evidence on actual limitations imposed by modern hardware. We continue in Section III with the challenges faced towards this testbed by presenting an analysis of commercially-available researchgrade radios and SDR platforms in which we assess trade-offs and limitations. We then present in Section IV the architecture of our system, where-besides fundamental modeswe also introduce the concept of NRT-OTA. We then continue by describing our feature enhancements to support MU-MIMO as well as our profiling and acceleration framework. This section also describes how SWORD's modularity can accelerate MIMO-related research by considering advanced non-linear detection for MIMO systems [20] - [23] as an application. The results of the acceleration and profiling enhancements are presented in Section V, along with the results of the OTA measurements. We proceed with the discussion of the results in Section VI, and finally, conclude the paper in Section VII.
II. WHY DO WE NEED YET ANOTHER MIMO PLATFORM?
In this section we highlight actual engineering hurdles that can hinder resource-constrained research-oriented development. These illustrate the need for our open, flexible and extendable platform that will be presented next.
A. SMALL RESEARCH TEAMS V. HUGE ENGINEERING PROBLEMS
It should be more or less evident that SoCs such as Balong 5G01 [24] , Exynos Modem 5100 [25] or experimental platforms e.g., Intel's Mobile Trial Platform [26] feature a prohibitive research and development (R&D) cost. Large part of the high R&D cost stems from the need to design an end-toend radio access network (RAN) within a restrained power envelope. Besides, reference SoC designs for these platforms are-and will probably always be-unavailable to academia as well as to small and medium-sized enterprises (SMEs). Furthermore and due to restrictive license agreements, any kind of availability in the form of a closed-source product would be of little merit to the scientific community.
In addition to the above, the huge increase in data rates aggravates the situation when MIMO and wideband operation become the targets. Figure 1 shows the indicative front-haul data rate for fifth-generation (5G) New Radio (NR) sub-6 GHz use cases at 15 and 30 kHz subcarrier spacing with respect to the number of radio frequency (RF) channels and the channel bandwidth. Data is displayed considering 16-bits per inline and quadrature (I/Q) part for two functional splits (i.e., time-and frequency-domain where only the active tones are passed between the remote radio head (RRH) and the baseband unit (BBU)). Essentially, the vertical axis shows the number of required 10 Gigabit Ethernet (10GbE) links required in each case for passing uncompressed I/Q samples between the RRH and the BBU. It is evident that the rate becomes prohibitive for high channel counts and bandwidth, easily exceeding 40 Gbps. With these kinds of functional splits, signal processing complexity gets then transferred to the BBU, alongside much stricter synchronisation requirements and fibre cost needed to handle such a high data rate [27] . The alternative is to ''transfer'' the complexity, including development effort to the RRH, by implementing a user-plane data split [27] .
To accelerate research into future wireless communications, Beecube, a Santa Clara-based company now acquired by National Instruments (NI), introduced a very powerful proprietary platform. Using Analog Devices' AD9361 frontend [28] and Xilinx SoCs [29] , Beecube presented a viable alternative to application-specific integrated circuit (ASIC) prototyping. Provided of course that one would be ready to invest approximately US$ 6000 per RF channel (limited to 56MHz) and many person-hours to develop the physical layer modules from scratch on a semi-proprietary framework harnessing Simulink and Xilinx's System Generator [30] . Apart from Beecube, Quebec-based Nutaq Innovations are still offering a similar pathway [31] . Following Beecube's acquisition, National Instruments brought Beecube's technology into its N310 Universal Software Radio Peripherals (USRPs) [32] and Labview, (i.e., NI's own development framework) into Beecube's hardware. Still, as we mention in Section III-A, the X300/X310 platform is the suggested option where scalable phase coherence is needed. Utilizing the X300/X310 provides a very capable frontend; a reasonable choice for future proof research, given that it halves the cost per RF channel and essentially doubles the bandwidth compared to Beecube's alternative. Choosing National Instruments' platform for end-to-end MIMO development though is also expensive and limiting. The reference MIMO framework relies on NI's Peripheral Component Interconnect (PCI) eXtensions for Instrumentation (PXI) modules and the Labview suite while it has no integrated forward error correction (FEC) solution [14] .
There is therefore a benefit in revisiting the soft-defined concept of solutions such as OpenAirInterface and srsLTE. But what about their real-time performance? Figure 2 shows our coarse-grain profiling results for physical layer computation time, on one of the fastest commercially-available processors, the Intel Core i9-7980XE. Presented results distinguish between operations in the frontend and in baseband, for single-input single-output (SISO) as well as MU-MIMO scenarios. We note here that OAI supports up to TM 1 at 10 MHz for TDD; remaining modes were added by us as presented in subsequent sections. Based on OAI's threading architecture [33] , real-time operation dictates that the sum of Rx frontend and baseband needs to be below 1 ms. Our measurements in Figure 2 show that even on an Intel Core i9-7980XE, TM 1 at 20 MHz looks to be infeasible, as do most of the higher order MIMO modes.
Hardware offloading onto a field-programmable gate array (FPGA) accelerator is also an option to be considered. Although options towards making this process more accessible do exist [34] , [35] , given the complexity involved in conjunction with the strict physical layer latency requirements, it is safe to assume that in this case there cannot be a software-like generalization. Additionally, as our measurements in Figure 3 show, transfers can be costly and therefore system-design non-trivial, even on a 16-lane PCIe 3.0 link, i.e., the fastest available on today's COTS programmable platforms. Our results measure the time to transfer to hostto-device (H2D) and from device-to-host (D2H) the FPGA device in the case of orthogonal frequency-division multiplexing (OFDM) subframe procedures for an increasing number of resource blocks (N RB ). For a single port, overheads dominate the transfer for all resource blocks (RBs) tested, as is the case for 2 ports up to N RB = 216, 4 ports up to N RB = 106 and 8 ports up to N RB = 52. We note that the host-to-device and device-to-host transfers will normally be present for OFDM modulation and demodulation respectively. These though depend on the radio platform's interface and the implemented functional split [27] . For a radio agnostic solution with the OFDM employed purely as an accelerator, the H2D/D2H transfer times should be summed.
Reflecting on the above, it seems that given the lack of a reference SoC and the cost of human resources, combining research-grade needs such as MU-MIMO support, flexibility and programmability with the unforgiving real-time deadlines of 3GPP is infeasible. What if we could leverage SDR programmability to maintain a full protocol stack without the need to meet strict RT constraints? In the following sections we present a platform that can actually achieve this.
III. CHOOSING THE BUILDING BLOCKS
This section outlines the most significant challenges one can meet when building a MIMO platform, from the choice of radio hardware to the SDR framework.
A. RADIO CHOICES AND CHALLENGES
In this subsection we present the status of COTS SDR platforms that can serve as the front-end of a research-grade baseband and also discuss their potential as part of a 5G-ready MIMO TDD system. While on one hand industry-grade units are rigorously tested, ruggedized and deployment-ready, on the other hand they offer limited flexibility, scalability, and dimensionality. They are closed-source in principle and require Common Public Radio Interface (CPRI) to connect to baseband [36] - [38] . Thus, they do not fulfill the needs of an experimental research platform. Still, there is an ample selection of research-grade radios in the market that can be cost-effective, programmable and MIMO-ready.
1) SDR HARDWARE OVERVIEW
Radio providers today are either small start-up enterprises (e.g., Lime Microsystems, Skylark Wireless, and Fairwaves), or larger enterprises such as National Instruments. Table 1 provides an indicative list of COTS SDR platforms, alongside their most important features; bandwidth, number of channels and MIMO capability and cost per channel. The vast majority of SDR platforms come in the form of radio ''slices'', i.e., a basic 2 × 2 transceiver that can serve as the basis for a larger MIMO configuration, subject to a multi-slice synchronization mechanism. With the exception of the N210, the X3X0 series and the IRIS-030, all SDRs feature an inbuilt front-end with a wide tuneable range that supports band n78, i.e., one of the first 5G New Radio bands in FR1 (sub-6 GHz) targeted for deployment and testing. The N210 and X3X0 also support band n78 through plugin daughterboards (as does the IRIS-030) which translate to a higher cost per channel. Excluding the N210, all SDRs support 3GPP-compliant sampling rates, via their internal master clock and embedded up/down conversion filtering. It is worth noting that most solutions are limited to 50 MHz of channel bandwidth due to their frontends and digital to analog/analog to digital converters. The USRP X series and the N310 have the potential to surpass this limitation. Note though that while the N310 inherently supports the 122.88 MS/s rate needed for 30 kHz subcarrier spacing, the X series have a 200/184.32 MHz master clock and thus need to be augmented by an external resampling mechanism, be it software-or hardware-based.
2) ON-BOARD PROCESSING CAPABILITIES
All SDRs have an embedded FPGA device which provides the digital frontend and interfacing functionality. These FPGAs also provide the potential for offloading extra functionality such as the time-to-frequency conversion, which can reduce front-haul requirements and free up resources for other tasks on the SDR host. The most powerful FPGAs reside in the X series USRPs (Kintex 7) and the N310 (Zynq 7), followed by a) the very interesting IRIS-030 (also Zynq 7) which seems to have found the best tradeoff between performance, flexibility and cost, and b) the spec-wise very similar XTRX Pro and Sidekiq (Xilinx Artix 7). Zynq-based FPGAs on the N310 and the IRIS-030 include an embedded ARM processor that greatly enhances their programmability while the X series include a soft processor based on FPGA slices. Interface-wise, Ethernet provides the highest flexibility and COTS connectivity, PCIe has the lowest latency and resource utilisation but also requires proprietary drivers; USB3, while also widespread has a higher utilisation and limits achievable rates. The X series, the N310 and the XTRX Pro are the most preferred options in that aspect. While the IRIS-030 does provide the option of a high-speed serial interface on a simple protocol (Xilinx's Aurora), it needs termination onto an FPGA accompanied with proprietary development and is thus the least attractive option in that sense.
3) MIMO AND SYNCHRONIZATION CAPABILITIES
Achieving MIMO support beyond that of the inherent transceiver slice requires a time reference, a clock and pulse distribution network and coherence among the RF oscillators. Hence it is dependent on both the analog and digital frontends. Almost all SDRs support external clock and pulse inputs, while the N310 and XTRX Pro also have an embedded Global Positioning System (GPS) for time reference. We should note though that the N310's AD9371 front-end is not suggested for phase coherent applications beyond those supported (i.e., 4 × 4) [32] . Thus, the most versatile options seem to be the X series as well as the XTRX Pro/Sidekiq.
4) PROGRAMMABILITY AND COST
Most SDRs provide their own open-source application programming interface (API), including low-level drivers alongside the FPGA firmware / bitstreams. There also exist abstraction layers such as SoapySDR [39] which act as an ''arbitrator'' between the radios and platforms such as OAI, srsLTE and GNUradio [5] , [6] , [40] . We note that Ettus / NI USRPs also provide the option of high level design tools [41] which speed up development at a high licensing / support premium; these usually require additional proprietary hardware, hence increasing the cost per channel. Suffice it to say, the latter is also a decisive factor towards an SDR platform. Given our above analysis, potential 5G feature support and the cost as highlighted in Table 1 , the USRP X series and the XTRX Pro are the most attractive solutions.
B. EXISTING SDR SOLUTIONS AND OPENAIRINTERFACE
We investigated several solutions which allow for costeffective wireless systems development. Their common feature is the research-grade radio hardware and a softwarebased stack running on general-purpose processors. Our findings are summarized in Table 2 . Aiming at the realistic validation and testing of advanced solutions and given the closed-source nature of Amarisoft's LTE and 5G-NR network software suites, 1 OAI [5] emerged as the most promising solution to form the basis for our testbed.
In general, OAI is an open-source SDR experimentation platform which strives to provide a software implementation 1 https://www.amarisoft.com/products/custom-projects/ of a complete protocol stack for fourth-generation (4G) and 5G mobile cellular systems, compliant with the 3GPP standards. It provides functionalities of UE, RAN, as well as a core network (CN) and can be used to deploy a lowcost, 3GPP-compliant, cellular network running real-time on COTS SDRs and standard Linux-based PCs. Existence of a large developer community and the combined effort of multiple OSA partners made OAI the most popular and advanced, publicly available SDR platform which aims to provide a flexible solution for conducting 5G research.
Despite OAI being in a clear lead compared to its open-source competitors (see Table 2 ), during our extensive investigations of its codebase, we found a number of important functionalities in 4G LTE and 5G NR to be still incomplete or missing. This is particularly noticeable in the case of 5G NR, as the OSA members have yet to achieve bidirectional SISO connectivity. In the case of 4G Long-Term Evolution (LTE) the missing (or incomplete) functionality relevant to the development of our testbed involves multilayer, non codebook-based precoding transmission schemes. Support of the latter was vital for SWORD's capability to test advanced MU-MIMO techniques. Additionally, we identified several issues related to OAI's emulation mechanism, as well as to its multi-threading, channel estimation, and uplink power control operations.
Our evaluation (Section V-A) shows that of similar importance to SWORD is the support of advanced SIMD instructions on the central processing unit (CPU). AVX2 and particularly its successor, AVX512 were designed to significantly improve the execution time of math-heavy subroutines such as physical layer signal processing. OAI's codebase does not provide comprehensive support for the Advanced Vector Extensions (AVX) technology, as the use of these instructions was originally precluded due to the lack of CPU support. The only OAI routines currently supporting AVX2 are the discrete Fourier transform (DFT), inverse DFT, log-likelihood ratio (LLR) estimation, lowdensity parity-check (LDPC) encoder/decoder and partly the turbo encoder/decoder. Other OAI routines such as channel estimation-related code-which as we will show significantly benefit from advanced SIMD use-have never been updated to support it. To the best of our knowledge and at the time of writing this paper, none of the OAI subroutines support the more recent AVX512 instructions. As our results in Section V-A will show, AVX512 support will be crucial for OAI to handle large MIMO configurations.
IV. PLATFORM OVERVIEW AND FEATURES
SWORD is a flexible, scalable and open-source platform based on COTS components. SWORD takes OAI as a baseline and extends it to enable investigation of large MU-MIMO approaches. Our platform operates in TDD mode and limits the channel state information (CSI) acquisition overhead by exploiting channel reciprocity. It supports up to 16 basestation antennas (N BS ant ), with a potential for further increase. SWORD inherits all of OAI's features, including support for 3GPP LTE, as well as for various COTS SDRs. It is worth noting here that due to the inherent complexity of the 3GPP air interface and OAI's structure, modifying the latter's codebase is far from being trivial and can often break existing functionality. Thus, development of any kind of functional OAI extensions can be a challenging task in itself.
The following subsections describe SWORD in more detail, beginning with an overview of the supported modes of operation and its baseline architecture. We introduce SWORD's NRT-OTA mode, an architectural/software co-design concept which enables us to rapidly evaluate research advancements. We then present enhancements to support MU-MIMO as well as our profiling and acceleration framework, which we will then employ to explore real-time performance. Finally, as an application that highlights SWORD's modularity, we present the steps for migrating an advanced nonlinear detection algorithm, originally written in Matlab.
A. MODES OF OPERATION
SWORD supports three different modes of operation, each of which can be employed for various applications: a) realtime, over-the-air (RT-OTA), b) non-real-time emulation (NRT-EMU) and c) non-real-time, over-the-air (NRT-OTA). The first two modes were originally supported by OAI in some form, with the last mode being introduced for the first time in SWORD. All modes assume single-cell operation.
The first mode allows for RT-OTA operation and supports communication with COTS UEs as well as with software UEs. Due to its strict timing requirements, this mode does not permit rapid evaluation of research advancements. The RT operation cannot be easily achieved as it requires highly optimized code. Even under these restrictions, the RT operation may be further limited to low bandwidths, or may require expensive (in terms of implementation effort) hardware offloading. The intended uses for this mode include very specific tasks such as experiments which investigate the impact of realistic channel aging on system performance. SWORD's setup for RT-OTA operation consists of a single PC which hosts an entire eNB protocol stack and a number of PCs which are used to host UE protocol stacks. The Host PC which acts as an eNB is connected to a number of SDR modules. These are being synchronized using an external clock distribution module to maintain phase coherence. This is not necessary on the UE side, as each Host PC in our setup is connected to a single SDR slice (and most SDR slices have two internally synchronized independent channels).
SWORD's second mode allows for NRT emulation and is intended for validation and testing. This mode is crucial for rapidly evaluating new features as it allows conducting tests in a fully-controlled environment, without potential hardware-related issues. It should be noted that additional effort has been devoted into extending OAI's emulation mode in order to support testing and validation of systems with a large number of N BS ant . Furthermore, issues related to simultaneous emulation of multiple UEs had to be addressed, including race conditions between threads responsible for baseband processing per UEs. A number of issues were also identified and resolved; these involved transmission over a virtual channel with sounding reference signal (SRS). An environment capable of supporting a large N BS ant and the simultaneous emulation of multiple UEs, is vital for validating and implementing new algorithms. This environment aims to reduce debugging effort during the OTA tests. In contrast to the RT-OTA mode, this setup does not employ any real radios. Virtual radios are emulated instead with transmission taking place over a virtual channel. We note that OAI's emulation inherently supports a number of channel models (e.g., Spatial Channel Model [43] , Extended Typical Urban [44] ). The use of virtual radios and channel allows for all UEs and the eNB to be hosted on the same PC, the latter's internal clock being employed as a common timing reference.
We now introduce the NRT-OTA mode, i.e., SWORD's third mode of operation. To the best of our knowledge, this full-stack mode has not been previously supported in any form by OAI or by any other existing research platform. This mode was designed to allow for validating advanced signal processing and scheduling techniques under real channel conditions without requiring extensive code optimization or hardware-assisted offloading. Moreover, NRT-OTA aims to enable realistic testing (including, if necessary, a full end-to-end connectivity over CN) for setups with a large N BS ant without the real-time issues associated with scalability. Although similar to NRT-EMU, NRT-OTA allows for the transmission over a real channel instead. The NRT-OTA's radio transceivers are connected to a single PC which hosts both the eNB side and the UE side. In order to handle the timing reference for transmission, reception and processing of the received samples, we reuse NRT-EMU's inter-thread signaling. Still, further synchronization between SDRs modules is also necessary to allow for transmission and reception to be initiated on both sides at the exact same time. This is achieved via an external clock distribution module, additionally employed to maintain phase coherence. The need for an external clock distribution module for synchronization of transmission and reception instances is foreseen to be removed in the next iteration of the NRT-OTA mode, with synchronization taking place over the air. It is worth VOLUME 7, 2019 noting that in contrast to NRT-EMU, NRT-OTA also needs to properly handle various aspects related to transmission over a real channel (e.g., propagation delay). For this purpose, we modified a subset of RT-OTA-dedicated routines employed to manage such aspects. Furthermore, to investigate different channel conditions, we explored antenna placement via long, low-attenuation cables which we used to interconnect antennas with SDR modules on the UE side. A simplified diagram depicting the operation of NRT-OTA for a single eNB and up to N layers = K UEs is shown in Figure 4 . In general, NRT-OTA's operation can be distinguished as a series of Tx/Rx periods with pause periods in between. During the Tx/Rx periods the SDR modules on one side are programmed to transmit, whilst the SDR modules on the other side are programmed to receive. For example, an uplink subframe (UL SF) constitutes a Tx period for the UEs and an Rx period for the eNB (Fig. 4) . The received samples are then transferred to the host PC and processed during the subsequent pause period. In contrast to the Tx/Rx periods which are always equal to the transmission time interval (TTI), pause periods can have a varying length. SWORD's design allows these periods to adapt to the amount of time necessary to process the received signal and to generate a new signal for transmission ( Fig. 4) . It is thus possible to bridge the gap between NRT-OTA and RT-OTA, given a high-end host PC and through suitable optimization of individual components and/or hardware-assisted offloading.
B. HARDWARE COMPONENTS
The main component of our platform is a x86_64 workstation whose CPU cores support the newest SIMD extensions [45] for vectorized baseband signal processing. Two such examples are the Xeon Gold 6154 and the Core i9-7980XE GPPs. For SWORD to be scalable and extendable, a high number of PCIe lanes are also necessary; these can potentially host multiple, multi-gigabit network interface cards (NICs) and programmable FPGA boards as a future enhancement. The former could be employed to interface with the radio transceivers (or to terminate traffic from the RRH) with the latter potentially employed as standalone accelerators. In a monolithic setup, the described workstation also hosts the radio transceivers. We note that for future proofing, a split architecture setup could instead be chosen. In that case, separate RRH and baseband nodes would exist, both based on x86_64 workstations. The aggregated and digitized RF data would then be exchanged between the RRH and baseband nodes as e.g., Radio over Ethernet traffic [46] , [47] . Multiple benefits stem from the inherent modularity of such architecture, allowing different computationally intensive physical layer operations to be split between the workstations. As the monolithic architecture is sufficient for proving the concept of our platform, a split architecture setup is not further investigated in the scope of this paper. For RT-OTA operation, a number of N layers = K x86_64 workstations are employed. These should have similar processing capabilities as the system used on the eNB side.
SWORD's radio transceivers employ a high speed Ethernet-based interface for interconnecting with the workstation(s). We chose the USRP X series with UBX as our radio frontend, because, in contrast to the N310 USRP [32] , this setup does not exhibit issues with phase coherence, the latter being necessary for MIMO operation. For providing a highly-accurate external clock reference and pulse distribution module, we exploit the Ettus Research Octoclock-G CDA-2990 [48] .
The multi-channel radio unit formed through synchronized USRPs on the eNB side is connected to a 3.4-3.8GHz multi-element consisting of a uniform linear array (ULA) which in turn is composed of half wavelength-spaced elements in a dual-polarized configuration.
C. INTRODUCING MULTI USER-MIMO SUPPORT 1) SUPPORT FOR ADVANCED TRANSMISSION MODES
As highlighted in Section III-B, one significant feature that is missing from OAI is support for multi-layer, non codebook-based transmission schemes.
To support such modes, we applied enhancements for single-layer, non codebook-based precoding, a mode known in LTE as TM 7. Although this mode was shown to be operational, it was demonstrated only in a setup with a COTS UE [16] . To enable operation in a setup with a software-based UE, a number of issues related to channel estimation, dynamic beamforming and handling of SRS (employed for calculation of beamforming weights) had to be resolved. Furthermore, we extended OAI by enabling signaling which is necessary for MU-MIMO transmissions with non codebook-based precoding (i.e., modes known as TM 8 and TM 9 in LTE). This included extending Radio Resource Control (RRC)-related subroutines and adding procedures for generation and handling of Downlink Control Information (DCI) (which indicates resource assignment in LTE). We provided missing baseband procedures (e.g. channel estimation, modulation and demodulation) for handling multiuser, multi-layer transmissions on both the RAN and the UE sides. This also included the implementation of software precoders and detectors, including maximum ratio transmission (MRT), zero forcing (ZF), and NL (see Section IV-E for more details). Additionally, we introduced a new MU-MIMO dedicated scheduler. The latter can simultaneously schedule multiple UEs over the same resources for both uplink (UL) and downlink (DL) transmissions.
Our implementation of advanced MU-MIMO TMs has been validated using SWORD's extended NRT-EMU mode (OAI's original NRT emulation mode did not support scenarios where N BS ant > 2 and did not work properly when SRS was used). We then tested using our newly-introduced NRT-OTA mode (described in more details under Section IV-A). Developed functionalities were also tested via RT-OTA mode, for N layers = 1.
2) BUILT-IN TDD CALIBRATION
Availability of channel state information at the transmitter side is crucial for proper operation of TMs with non codebook-based precoding. In TDD systems, channel state information can be obtained at the transmitter by exploiting channel reciprocity. In order to achieve channel reciprocity on a real system, a calibration procedure is necessary. The role of calibration is to remove/compensate for imperfections between hardware used on both sides of a link. Although OAI has been already used to demonstrate TM 7 in a setup with large number of antennas [16] , the calibration code for such a procedure has never been released to the public. To that end, SWORD includes a built-in TDD calibration procedure that has been introduced as part of this work. This procedure is executed on system startup and can be periodically triggered during runtime.
A number of calibration techniques have been proposed in the literature (e.g. [49] - [52] ). The technique implemented in our platform exploits relative calibration which does not require external reference sources [49] . More specifically, we implemented the internal base station (BS) calibration procedure originally presented in [50] . This procedure involves bi-directional transmission of a calibration signal (the SRS signal in our case) between a BS reference antenna and all antenna elements in a BS array. Our platform performs calibration during its initialization phase and then exploits the obtained coefficients to adjust instantaneous UL channel estimates during runtime. To compensate for any phase drift (e.g., due to temperature fluctuation or caused by other characteristics), the procedure is periodically repeated.
D. PROFILING FRAMEWORK
As part of our flexible platform, we introduce a profiling framework, that can accurately measure the execution time of signal processing operations. This allows us to assess PHY layer procedures within the scope of MIMO systems and explore how different configurations and transmission modes affect their execution time.
The framework is based on OAI's time-stamping mechanism which relies on the Time Stamp Counter (TSC) integrated in Intel GPPs. This way, the latency of each function can be measured in actual clock cycles which are then translated to an absolute time reference after they have been averaged. Note that the TSC increments at a constant rate, meaning that it is driven by a clock with a steady frequency. We introduced a dedicated performance counter for every DSP function in OAI's PHY layer. In this manner, we can get time-stamps before the call of a function and after its execution, then calculate the difference and store it in a performance counter. Since most of the DSP functions are executed several times in the context of a radio frame, our performance counter also keeps track of the total number of function calls and subsequently calculates the mean, minimum and maximum value of a specific function's latency.
Since we aim to fully explore physical layer execution time, besides a finer-grain evaluation of individual DSP blocks like signal detection, channel estimation, etc., our framework also offers insights into aggregate operations. For example, we can have per subframe monitoring of the scrambling procedures for a specific user up to aggregate uplink PHY execution time.
E. SIMD ACCELERATION FRAMEWORK
From the era of embedded processors which lacked a floating-point unit, fixed-point arithmetic was employed to VOLUME 7, 2019 allow for higher quantisation and data throughput [53] . Fixed-point representation relies on having integer arithmetic throughout all computations; note though that the radix point, i.e., the corresponding point in decimal arithmetic, is now implied instead of being explicitly defined. This means that the developer has to keep track of it when modelling fixedpoint operations as there is no provision for its storage.
1) ACCELERATED PHY DSP TOOLSET
To accelerate computations, OAI employs mainly ''handoptimized'' (i.e., employing low-level intrinsics) 128-bit SIMD operations (with the exception of Fourier transforms which use 256-bit SIMD). This is complemented by fixed-point arithmetic using 16 bits per I and Q sample, which are interleaved within storage elements. This effectively allows processing four complex values in parallel within a single 128-bit register. As an enhancement to OAI's SIMD processing, we introduce an SIMD acceleration framework which has the following features: a) support for AVX2 and AVX512 datapaths, b) focuses on MIMO operations, c) is transparent, targeting reusable DSP functions so that all its benefits are applicable regardless of the OAI branch and d) provides series length checking for robustness. Through the use of this framework, we will also show in Section V-A where can software execution take us limited by current technology.
OAI includes a toolset of DSP functions that are being employed throughout the physical layer. These involve operations on complex time series between vectors as well as between scalars and vectors. Additionally, there is a Fourier transform library written in AVX2 that is mainly employed in the physical uplink shared channel (PUSCH), physical downlink shared channel (PDSCH) and physical random access channel (PRACH) channels.
We develop our accelerated toolset by maintaining the use of intrinsics as this provides finer control over automatic compiler vectorization [54] . Regarding simple functions (e.g., scalar addition, scalar multiplication) we extended the vector width and the loop stride to support the wider register instruction sets. In these cases, most of the Streaming SIMD Extensions (SSE) intrinsics have an AVX2 and an AVX512 counterpart with a similar set of micro instructions. Additionally, complex multiplication, rotation and dot product functions were rewritten from scratch. OAI's versions revolve around the _mm_madd_epi16 intrinsic (i.e., the pmaddwd instruction) which horizontally multiplies and adds two 128-bit vectors across 16-bit boundaries. Figure 5 illustrates our optimisation in the case of complex vector multiplication via 128-bit SIMD operations, by listing both the pseudocode and the assembler output (g++ 8.3.1 compiler). We note that OAI's version creates 32-bit intermediate results. Thus, two 128-bit vectors are required to hold 4 complex numbers. Our optimised version targets elimination of horizontal operations. Moreover, it takes into account OAI's fixed-point representation to merge the pmaddwd, psrad, punpckldw and packssdw operations into the pshufb, pmulhrsw and paddsw instructions which maintain a 16-bit output. As Fig. 5 shows, this generates fewer instructions that also have lower latency [55] . The impact may be little for a single loop execution but as we will show in Section V-A, it can be significant for a loop that is executed N FFT vs · (N BS ant ) 2 ·N subframe syms times per downlink subframe, as is the mul-tadd_cpx_vector function, written in the same manner [56] . With N FFT we denote the number of samples in an OFDM symbol, vs the number of samples in the SIMD vector, N BS ant the number of base station antennas and N subframe syms the number of OFDM symbols in a subframe. We followed a similar strategy for optimising other functions. Section V-A presents speedup results in a unitary manner, as well as within the scope of end-to-end PHY operations.
Integrating the toolset into OAI: All extensions are added using preprocessor definitions so that the same function can be executed on all platforms supporting SIMD. To align the allocated buffers according to the vector size (vs) boundary requirements, OAI's allocation functions were rewritten to employ posix_memalign instead of the deprecated memalign function. We note that while SSE requires 16-byte alignment, AVX512 needs buffers to be aligned to 64-byte boundaries [54] .
It should be noted that OAI only provides support when the data series length N is evenly divisible by the SIMD vector width vs (i.e., vs | N ). There is no boundary checking beyond this. While this might be sufficient in the vast majority of the cases with SSE, it will lead to segmentation faults for wider vector sizes or in general when vs N . To make our framework more robust, we include length checking per vs and per function. Data, up to the point where its length is evenly divisible by vs, is processed through a main loop. For the remaining samples, an aligned static array is initialized whose length equals to vs. These samples are then copied from the signal input into this array and the vector pointers are reinitialized to point into the array. This guarantees that the same intrinsics as those in the main loop will be executed on a properly aligned buffer.
2) ACCELERATED ZERO FORCING PRECODING/DETECTION
Zero forcing (ZF) is a well-known method that is widelyemployed in large and massive MIMO research platforms [13] , [50] , achieving high spectral efficiency gains when N BS ant N layers . As part of our MIMO enhancements, we developed a fully software-based precoding/detection subsystem that can be employed autonomously or as part of OAI's PHY layer. The precoding/detection subsystem, consists of a ZF precoder and a ZF detector.
Although-at least algorithmic-wise-ZF is considered to be a simple, linear method, its computational and storage complexity are non-trivial and increase polynomially with both N BS ant and N layers . This is because ZF mainly consists of complex matrix multiplications and inversions. We developed a ZF precoder/detector supporting both AVX2 and AVX512 intrinsics. The module is highly-configurable as listed in Table 4 , where N RE is the number of active resource elements and N th the number of worker threads as described below. In terms of functionality, detection and precoding are performed on a subcarrier basis, meaning that distinct channel matrices are assumed for every subcarrier. Although N subframe syms is configurable, we assume the inversions of the channel matrix to be performed per subcarrier, but only once per group of OFDM symbols within a TTI. Input and output data to the precoder/detector is represented using 16-bit fixed-point arithmetic and internal calculations are performed in single-precision floating-point arithmetic. Just like our OFDM framework presented next, our soft precoder/detector additionally features multi-core capabilities based on the OpenMP library [57] . To completely avoid race conditions, each thread is assigned with a workload corresponding to a separate group of subcarriers and is mapped to a specific socket/core. Section V-A displays performance results based on our ZF precoder/detector.
3) OFDM SUBFRAME PROCEDURES: SIMD MULTI-CORE FRAMEWORK AND PROGRAMMABLE FPGA ACCELERATOR
This is one of the most suited candidates for further optimization and offloading onto special purpose units, due to its fixed complexity and the increased RF channel count in large MIMO systems. In the UL direction, OFDM processing consists of a DFT, a shift of the zero-frequency component to the center of the spectrum, magnitude normalization and cyclic prefix (CP) removal. DL direction requires an inverse zero-frequency shift, an inverse DFT, magnitude normalization and CP addition [58] . Due to the removal of the guard band and the cyclic prefix, OFDM subframe procedures also constitute an effective method for reducing the fronthaul raw data rate [59] . The DFT itself has a complexity of O(N 2 DFT ) which the well-known fast Fourier transform (FFT) can decrease to O(N FFT · log(N FFT )) [60] .
a: OFDM SOFTWARE ACCELERATION/ ASSESSMENT FRAMEWORK
While libraries that accelerate DFT processing do exist: a) Fastest Fourier Transform in the West (FFTW) [61] , b) Intel's Math Kernel Library (MKL) [62] and c) Intel's Integrated Performance Primitives (IPP) [63] , to the best of our knowledge, there has been no aggregate evaluation of them in a 3GPP context for SDRs. Moreover, the OAI alliance has only just recently started to investigate the matter [64] . Therefore, our OFDM framework fulfills the need to assess optimized performance for a large number of N BS ant in a holistic manner.
Our multi-core OFDM framework allows the runtime definition of the number of physical resource blocks N RB , slots to schedule, N BS ant , N th , the value of µ (for subcarrier spacing [58] ) as well as the number of repetitions for averaging. Multi-threading/multi-core capabilities are implemented via the OpenMP library. For evaluation purposes, any of the above three libraries and OAI's AVX2 optimized functions can be chosen at compile-time. We present aggregated evaluation results of the libraries and our multi-core framework in Section V-A.
b: PROGRAMMABLE FPGA ACCELERATOR FOR OFDM SUBFRAME PROCEDURES
As a case study, a 16-channel OFDM offloading architecture was developed and integrated with the PCIe subsystem onto a Bittware XUPPL4 accelerator. The role of the OFDM FPGA architecture is to offload relevant functionality from the x86_64 GPP cores of the baseband unit. The architecture is based on Xilinx's Radix-4 DFT core, features two distinct 16 channel paths for UL and DL (designed as a 2 × 8-channel architecture) and can thus support both TDD and frequency-division duplexing (FDD). It employs fixed-point arithmetic, implements the same functionality as our OFDM subframe procedures framework, achieves a maximum operating frequency of 491.52 MHz on the XUPPL4 and can thus support up to 100 MHz of bandwidth at 30 kHz spacing. The 16-channel architecture requires 50381 lookup tables (LUTs), 78337 registers, 204 Block RAMs, 288 DSP slices and 10081 configurable logic blocks (CLBs). The module is flexible and parameterizable so that it can be also employed as part of the FPGA in a 2 × 2 SDR transceiver slice by reducing the channel count. While the architecture cannot directly support FFT sizes of 1536 samples, to the best of our knowledge, it is the sole flexible non-commercial solution that can support all features of our software framework. In Section V we compare offloading results against our multi-core OFDM processing framework.
4) CHANNEL DECODER
As we show in Section V-A following our extensive profiling measurements, the channel decoding function is the main computational bottleneck on the base station. OAI's LTE-derived air-interface includes a turbo decoding engine optimized around 128-bit SIMD instructions, which employs 8-bit LLRs internally. While OAI's develop-nr branch does include an experimental AVX2-optimized LDPC decoder [65] , the rest of the stack is far from complete to serve as the foundation for full stack framework such as ours. Despite this being one of the main reasons that the LTE-derived stack is the basis of our platform, we chose not to optimize the turbo decoder as the complexity involved would provide marginal benefits for an end-to-end platform with future mobile standards in mind. Still, in Section V-A we profile OAI's LDPC decoder in the context of computational complexity, presenting results for supported 3GPP modes involving up to 4 layers and an upper bound of 8 iterations (i.e., same as in OAI's Turbo Decoder). Additionally, we introduce AVX512 support in OAI's LDPC decoder by expanding on the implemented AVX2 strategy for bit node and check node processing. We also replace expensive permutation functions with packed bit extraction and insertion. Section V-A discusses some preliminary profiling results from our exploratory AVX512 support.
F. SWORD'S MODULARITY IN PRACTICE: INCORPORATING ADVANCED PHY ALGORITHMS FOR LARGE MIMO SYSTEMS
The combination of a modular software framework and SWORD's unique NRT mode allow to rapidly quantify the gains of novel signal processing algorithms. Since SWORD relaxes 3GPP's strict real-time deadlines, it allows even computationally-demanding MIMO algorithms to be evaluated without necessitating HW/SW optimization expertise.
One of the most interesting signal processing approaches is the recently proposed massively-parallelizable framework for non-linear detection in large MIMO systems [20] - [22] . The potential gains in terms of throughput and connectivity of non-linear detectors, e.g., sphere decoder (SD) [66] , are well-documented in the literature. Yet, the latency and complexity requirements of such approaches prevent them from being adopted by practical multi-antenna deployments. Instead, current MIMO testbeds exclusively opt for simple but suboptimal linear detection techniques-e.g., matched filter (MF), ZF [11] , [13] , [16] , [67] -which can leave significant unexploited MIMO capacity. The massively-parallelizable approach of [22] can for the first time potentially bring practical near-optimal MIMO detection into reach. Still, to merit the high expenditure required for developing such a solution in a real-time system, an accurate assessment of its gains in a standard-compliant environment needs first to be conducted. To that end, the detection process of [20] was integrated into the SWORD platform as an external function originally written in Matlab. In the rest of this section we describe the integration procedure of the aforementioned non-linear detector serving as an exemplar for the modular research methodologies that SWORD facilitates. The first step, common when targeting the integration of any signal processing approach, is to specify an interface with SWORD's PHY layer. For this purpose we define an input structure that will be handed over to the external function and an output structure that is returned from SWORD's PHY. The input structure consists of a) a two-dimensional 16-bit integer (int16) array containing the channel estimates for all base station antennas and occupied subcarriers (N BS ant · 12 · N RE ), b) a two-dimensional int16 array containing all received samples within a subframe (N BS ant · 12 · N RE · N subframe syms ) and finally, c) a structure containing general information on the MIMO configuration (i.e., N layers , N BS ant , N subframe syms and the QAM modulation order). These inputs should be common in MIMO detectors simplifying an adaptation of the interface to other algorithms. Specifically for the massively parallelizable detector of [20] , the input structure contains two additional variables, one that indicates the number of evaluated vector solutions, and one that indicates a LLR threshold. The output structure consists only of a one-dimensional 8-bit integer (int8) array containing the calculated LLRs (N BS ant ·12 · N RE ·N subframe syms ·Q m ·1), where Q m refers to the bits per symbol.
To generate the code required to implement the non-linear detector in SWORD, we utilized Matlab's coder toolbox for porting the detector into C source code. The coder translates the top level function of the detector to a MEX file (compiled code designed to run in Matlab). Simple wrapper functions were employed to accommodate for any data type conversions between SWORD and those specifically required by the Matlab coder. Detailed information on the capabilities of Matlab coder can be found in [68] .
V. EXPERIMENTAL RESULTS
This section is divided into two major subsections. First, we present our profiling results with and without our acceleration framework. Results are presented first in the form of unitary functions and subsequently, as measured when integrated into our platform. The second major subsection presents indicative initial OTA measurements that showcase the versatility of SWORD's RT-OTA and NRT-OTA modes.
A. PROFILING AND OFFLOADING RESULTS
In this section, we present results from our extensive profiling campaign, first for unitary functions and then within the scope of the OAI stack. To maintain and provide a broad perspective of performance on modern x86_64 systems, we profile execution on GPPs that support either AVX2 only, or AVX512 and AVX2. For representing the state-of-the-art, we choose an Intel Core i9-7980XE with 64GB RAM based on Intel's Skylake architecture, supporting AVX512 with dual fused multiply-add units per core. We employ a Xeon E5-1620 v3 with 64GB RAM (i.e., based on Intel's Haswell architecture) to profile code optimised only for AVX2. Both systems are running CentOS 7 with GNU gcc/g++ compiler version 8.3.1 and linux kernel 5.3. We chose the most recent kernel for better hardware support, instead of opting for a real-time kernel. Still, for the purpose of achieving close to deterministic performance, we configure all CPU cores to operate at their base frequency as follows: a) The latency induced by switching between idle and boosted performance states was minimized by disabling all relevant functionality in the Basic Input Output Systems (BIOSes), b) Logical cores were switched off and c) The operating system governor was set to performance mode [69] via the tuned-adm tool [70] .
1) PROFILING METHODOLOGY a: DSP TOOLSET
In order to profile our toolset optimizations, we created unitary C testbenches for which the fixed-point input range, data size N and number of iterations can be configurable at runtime. This allows to test both functional correctness and performance within the same run. All functions were executed for series of N = 4096 randomized complex samples (i.e., a + jb) with a, b uniformly distributed in [−1, 1) and execution time averaged over 10 6 iterations. For assessing the overhead of our length checking routines, we also considered tests with 4095 samples.
b: ZERO FORCING PRECODING/DETECTION
To assess the performance of our optimized ZF precoder/detector, we compared our AVX2-and AVX512optimized execution with a non-vectorized C model. We [71] . We then evaluate multi-core execution performance of the fastest datapath (i.e., AVX512) for the most challenging (i.e., 16 × 4) MIMO configuration.
c: OFDM SUBFRAME-BASED PROCEDURES
To assess AVX2 and AVX512-optimized OFDM processing performance, we profiled execution of all libraries on the E5-1620 v3 and the i9-7980XE, both set to operate at 2.6 GHz. OFDM initialization is performed once at the start of each execution and is thus excluded from benchmarking. The term Tx / Rx subframe is employed to denote that all symbols in the corresponding subframe are respectively subjected to DL / UL only OFDM signal processing. We showcase results for N FFT ∈ {512, 1024, 2048, 4096}. These correspond to N RE ∈ {300, 624, 1272, 3240} and N RE ∈ {288, 612, 1272, 3276} at 5, 10, 20, and 50 and 10, 20, 40, and 100 MHz bandwidth for 15 kHz and 30 kHz SCS, respectively [71] . To quantify multi-core performance, we evaluate the FFT library that proved to be the fastest in the above tests, applying a fixed mapping of threads to cores to antenna ports. We set the latter to 16. VOLUME 7, 2019 FIGURE 6. Accelerated DSP toolset speedup v. OAI's SSE baseline on the E5-1620 v3 (Haswell) and the i9-7980XE (Skylake).
d: LDPC DECODER
OAI already includes a testing framework for its AVX2optimized LDPC encoding and decoding functions. 2 Both Base Graphs (i.e., 1 and 2) and all lifting sizes are supported from the latest 3GPP standard, with a maximum codeword size of 8448 bits [65] . While the testbed itself is single core only, transport block sizes can be split into a maximum of 8 codeblock segments and potentially run on separate cores. We evaluate performance for N layers ∈ {1, 2, 4}, using modulation and coding schemes 9, 10 and 27 [58] (corresponding to QPSK, 16-QAM and 64-QAM modulation respectively). Presented results are indicative only and this is due to the limited number of code rates supported [72] . We focus on the decoder, profiling performance for 106, 52 and 25 resource blocks. We note that our profiling focuses only on runtime complexity of the decoder. The latter executes a maximum of 2, 4, 6 and 8 iterations (N max iters ).
e: OAI PHYSICAL LAYER STACK
Physical layer profiling within the OAI stack was designed to explore performance and to evaluate runtime scalability under different MIMO configurations. Another target was to assess the runtime impact of our SIMD acceleration framework in the context of a full SDR-based base station protocol stack. To that end, besides our DSP toolset, we also integrate our optimized ZF precoder/detector and the FFT routines of Intel's MKL (i.e., the library achieving the highest performance as shown in Section V-A2.b). We will now describe our PHY profiling methodology. Experiments focus on the base station side (i.e., as it involves the most computationally demanding operations) using OAI's emulation mode. This facilitates profiling as it allows execution on a single host. It should be noted that while this mode emulates the radio equipment and the channel, the rest of the protocol stack runs as it would on a real-time system, without any shortcuts taken (i.e., normal OTA mode IV). Furthermore, the emulated portion of the system, such as the radio equipment and the channel, is handled by separate, independent processes. The multi-core nature of modern GPPs such as those employed in our platform can warrant that these tasks will be sufficiently isolated and hence, our 2 We test on code cloned from the develop-nr branch as of 16/10/2019. extracted results accurate. Presented physical profiling results involve single threaded execution with any extra workers disabled in the configuration files. While other configurations were also tested, the impact of multiple workers was marginal, as we discuss in Section VI. The base station was set to operate using TDD LTE subframe configuration 1, at frequency band 38 and in monolithic mode. This means that both frontend and baseband processing take place exclusively on the base station.
All experiments conducted follow the procedure described below. We initially configure the base station and UEs' packet-based interfaces to form internal loopbacks. This is done to execute everything on a single host. We then spawn separate processes/modules for the base station and all UEs (depending on the MIMO mode). After all processes have been initialised and the attachment between the base station and the UE(s) established, we generate bidirectional (i.e., uplink and downlink) data traffic via the iPerf tool [73] . Once the aforementioned process was stabilized we gathered profiling results over the course of 60 minutes for each configuration.
2) PROFILING RESULTS a: DSP TOOLSET
Profiling results for unitary functions are depicted in Fig. 6 , where vs | N = 4096. Optimizations on functions operating over 32 and 64-bit boundaries (e.g., add_cpx_vector32) present a moderate average speedup of 1.64× across all SIMD generations. Optimizations on functions operating on 16-bit boundaries achieve the expected performance on AVX2 and AVX512. On the Skylake architecture, speedup can reach superlinear values in some cases (e.g., add_cpx_vector) due to the increased amount of cache present on these cores (i.e., 1 MB per core v. 256 KB per core on the E5-1620 v3). As expected, the Haswell-only architecture achieves slightly lower speedups compared to code compiled with AVX2 flags but running on a Skylake GPP. The dot product achieves a sublinear 1.8× speedup on average for AVX2 and 2.82× on AVX512, due to the additional reduction operations it requires. Vector multiplication functions that were rewritten from scratch according to Section IV-E, were on average accelerated by 1.7× on SSE, 2.9× on AVX2 and 4.5× on AVX512. Particularly the complex rotation function, which was rewritten to operate on contiguous memory, displays an average speedup of 3.3× on SSE, 6.5× on AVX2 and 11.92× on AVX512, compared to OAI's initial SSE implementation. The penalty for the cases where vs N = 4095 is negligible, apart from complex rotation with AVX2 instructions (6.2× v. 6.8×) and vector addition/subtraction with AVX512 instructions (1.0× v. 3.9×) . The latter has a limited impact on UE side only. Figures 7a and 7b present the single-core performance of the AVX2-and AVX512-optimised ZF detector/precoder for 15 and 30 kHz subcarrier spacing, respectively. For the purpose of comparison, the performance results of a non-vectorized C version (Plain C) have also been included. Both figures highlight the significant potential of AVX512 for DSP in large MIMO systems. Compared to non-vectorized C code, AVX512-optimized results indicate more than an order of magnitude performance increase in all tested MIMO configurations and bandwidth modes. Our measurements indicate a 12.44× and 10.56× average detection and precoding speedup corresponding to 91.6% and 89.7% decreased respective runtime. Compared to AVX2, AVX512 achieves a detection speedup of 1.84× and a precoding speedup of 1.45×, translating to 45.7% and 30.9% processing time gains. When compared against non-vectorized code, AVX2 speeds up detection and precoding by 6.81× and 7.21× on average across all cases.
b: ZF PRECODER/DETECTOR: SIMD SINGLE-AND MULTI-CORE PERFORMANCE
It should be highlighted that by utilizing AVX512, we can reduce execution time below the duration of a single subframe (i.e., 1ms) both for precoding and detection in almost every 15 kHz SCS case except for 16 × 4, 16 × 2 and 16 × 1 MIMO at 50MHz (N RB = 270). In the more demanding set of 30 kHz SCS measurements, the detection and precoding computation time remains well-below 1ms under all bandwidth modes for up to 8 × 1 MIMO. When considering N RB ∈ {24, 51}, we can also reach lower than 1ms of detection/precoding execution time in larger MIMO configurations. For N RB = 273, single-core execution wellexceeds the 1 ms duration under all MIMO configurations for which N BS ant = 16. Figures 7c and 7d showcase how AVX512-optimized code combined with multi-core execution can further accelerate ZF procedures. As is depicted in Figure 7c , for 15 kHz SCS and N RB = 270, execution exceeds the 1ms deadline; at least two cores are needed in order to keep runtime below 1ms under all bandwidth modes. Similarly, for the case of 30 kHz SCS (Figure 7d ) at least 4 cores are necessary. Among all bandwidth modes and considering both 15 and 30 kHz SCS, the 2-, 4-and 8-core execution present average respective speedups of 1.93×, 3.59× and 6.54×. Finally, it should also be noted that for N RB > 25 multi-core execution exhibits a close to linear speedup. For example, in the case of N RB = 273 speedup respectively reaches 1.97×, 4.02× and 7.94× for 2, 4 and 8 cores. This can be attributed to the fact that each core is assigned with completely independent detection/precoding workloads. Thus, our measurements clearly illustrate the significance and benefit of AVX512 and multi-core execution for software-based detection/precoding. c: OFDM SUBFRAME -BASED PROCEDURES: SIMD SINGLE CORE PERFORMANCE Figure 8 shows OFDM computation time considering both UL/DL slots for N BS ant = 1. Results show that for all libraries, 30 kHz SCS approximately doubles processing time in every DFT size, regardless of the instruction set architecture (ISA). As expected, AVX512 outperforms OAI's AVX2-only DFT rendition. When considering only AVX2 instructions, we notice that OAI is more efficient than FFTW in all transmission bandwidth modes and negligibly better than IPP in all DFT sizes except for the case of N FFT = 2048. Finally, Intel's MKL achieves the best performance in all framework testing scenarios and configurations. We note that for N FFT = 4096, i.e., the most demanding computationally, the AVX2 version of FFTW surpasses the duration of the 30 kHz SCS subframe (i.e., 500 µs) by up to roughly 110 us. Our evaluation clearly shows that libraries utilizing SIMD instructions running on GPPs can be considered a promising candidate for accelerating OFDM subframe execution for 15 and 30 kHz spacing. Furthermore, Intel's AVX512 MKL provides the highest performance.
d: OFDM SUBFRAME PROCEDURES: SIMD MULTI-CORE AND PROGRAMMABLE ACCELERATOR PERFORMANCE Fig. 9 shows the results for multi-core execution on the i9-7980XE (averaged over UL and DL), exhibiting a sub-linear behavior in all cases with a maximum speedup of 4.4× (for N RB = 273 with 16 cores). Core synchronisation overhead is significant and hence OFDM multi-core execution with OpenMP is beneficial only for more than 3 cores and more than 100 resource blocks. FPGA speedup is more modest, up to a maximum of 2.5× (for N RB = 273). Transfer overheads employ more than 95% of the total time, especially for N RB ≤ 52. Thus, the AVX512-optimized multicore framework can potentially surpass the performance of the FPGA-offloaded module; in order to achieve this though, a significant amount of at least 4 dedicated i9-7980XE cores are necessary.
Still, we note that core utilization corresponds to averaged execution in a single direction (i.e., either UL or DL). Bi-directional execution as in FDD would therefore require doubling the cores, whereas the FPGA already has both UL/DL modules instantiated (Section IV-E). To compare the performance of our multi-core OFDM framework with our FPGA-offloaded OFDM architecture we took into account the FPGA→host and host→FPGA transfer times considering blocks of 14 symbols, as well as the time to process the whole subframe on our Radix 4-based multi-channel architecture. This is indicative of the worst-case performance expected of the FPGA-offloaded module. We note that both cases (i.e., multi-core and FPGA) would require the x86_64 subframe to be transferred to/from the radio frontend. Presented multi-core results correspond to the most optimistic scenario (no transfers to/from radios are being considered) while FPGA-offloaded measurements assume bidirectional transfers. OFDM subframe procedures are normally tightly coupled to the radios and thus only one-way transfers would be applicable in the FPGA case. We also note that the presence of distinct modules on the FPGA allows for overlapping between transfer and computation and can thus shorten the total time. The exact scheduling and interrupt/polling scheme on an integrated stack depends on a multitude of system-wide parameters and supported features This is an SDR -specific overhead that is left for future work. e: LDPC DECODER PROFILING RESULTS Figure 10 shows the average, single segment execution time of OAI's AVX2-optimized LDPC decoder against N max iters . We denote results via N layers _N RB _MCS. Decoding latency exhibits a sublinear increase in with N max iters . The most demanding case involves 4 layers at MCS 10, with 52 resource blocks (or 2 layers at MCS 10, with 106 resource blocks). In these cases, two iterations require approximately 171 µs and execution can exceed 580 µs for N max iters = 8. We note that these results refer to one of the six required segments. Next is the case of 1 layer with 52 RBs (and 2 layers with 25 RBs) at MCS 10, with 1 3 code rate (for one of the two segments). Their decoding latency ranges from 149 to 485 and 145 to 479 µs respectively (at N max iters 2 and 8). Decoding a single segment for a single layer with MCS 9 and 106 RBs requires approximately the same latency as two layers with 52RBs, i.e., approximately 47 µs per iteration. Next is the 4 layer case with 25RBs using MCS 9, requiring 91 to 175 µs (2 and 8 N max iters respectively). Closely follow the cases with N layers = 2, N RB = 25 and N layers = 1, N RB = 52 at MCS 9, requiring between 80 and 249 µs (one out of two segments displayed). The remaining scenarios are those requiring the lowest latency, between 57 µs to 168 µs. Preliminary AVX512 optimization shows a moderate reduction in latency that does not exceed 35 µs in the case of 2 layers with N RB = 106 at MCS 10 and N max iters = 8. This indicates that OAI's LDPC may need to be redesigned bottom up to further benefit from AVX512. Moreover, the high overall decoding latency hints that FPGA-assisted offloading may be beneficial if not mandatory, especially for 8 iterations and 16-QAM or denser modulation schemes. 
3) OAI PHY INTEGRATED RESULTS
Profiling results presented in Figure 11 refer to the total Rx and Tx baseband execution on the Xeon E5-1620 v3 (AVX2) and the i9-7980XE (AVX512). These procedures correspond to the PHY operations in the uplink and downlink direction and are scheduled on a subframe basis. Figure 11 presents present distinct PUSCH-and PDSCH-related DSP operations to assess their execution time compared to the Rx and Tx total. We also note that ''Mapping'' refers to the sum of the layer mapping plus the QAM modulation procedures. Figures 11a and 11b show physical layer profiling results on the Xeon e5-1620v3 i.e., a system that supports up to AVX2 instructions. Optimised measurements are denoted via the ''opt.'' keyword. Reference measurements used for comparison correspond to the same setup when OAI's default 128-bit SIMD DSP toolset (master branch v.1.0.3) is employed instead (still with our multi-user MIMO enhancements). We also note that all measurements in Fig. 11a and 11b (i.e., both optimized and reference) involve our ZF precoder/detector (as OAI included no advanced MIMO modes out-of-the-box). In this case, the reference results integrate a non-vectorized C version of our precoder/detector.
As expected, Rx procedures are significantly more complex than Tx procedures. This is mainly attributed to the channel estimation, MIMO detection and channel decoding operations, the latter being approximately 6× more complex compared to channel encoding. The functions that exploit our AVX2-optimized toolset the most are the DMRS channel estimation and the beam-weights application. In the case of 16×4 MIMO, these exhibit a reduced runtime by 24.7% and 49.9%, corresponding to an absolute reduction of 294 and 311 µs respectively. Furthermore, our optimisations visibly affect the remaining physical channel operations of the Rx subframe procedures, decreasing their execution time from 68.8% up to 78.2% across all MIMO modes. This translates to up to 502 µs savings in the most demanding MIMO configuration. Despite the merit of our optimized code, notice that even in the 4×4 MIMO case execution time exceeds the 1ms deadline by 657 µs and 48 µs (Rx and Tx processing, respectively). It is also worth noting that the only scenarios for which both the uplink and downlink PHY runtime stays below 1ms, are the 2×2 and 4×2 MIMO configurations. Similarly to the AVX2 results, the DMRS channel estimation, the beam-weights application and the ''Remaining'' Rx procedures exhibit the highest acceleration. Our AVX512 optimised code reduces DMRS channel estimation runtime by 25% on average for N layers = 4, while for N layers = 2 runtime reduction ranges between 28.01% and 42.59%. In absolute figures, this reduces execution time by 186 µs for 16 × 4 MIMO. The beam-weights application function is most prominently accelerated in the computationally complex cases of 8 × 2 and higher-order MIMO. Corresponding speedup factors range from 2.27× up to 2.70×, leading to reductions by 264 µs and 271 µs in the cases of 16×4 and 16 × 2 respectively. The AVX512-based toolset contributes to lowering the total respective Rx and Tx execution times by 2160 µs and 805 µs for 16 × 4 MIMO. Figure 11d illustrates that apart from the 16 × 4 MIMO scenario, all Tx procedures for all configurations tested exhibit below 1 ms of total execution time. Regarding the Rx direction, our AVX512 optimizations allow execution time below 1ms for all cases for which N layers = 2 excluding the 16 × 2 MIMO. We note here that while exceeding the 1ms barrier guarantees that execution will not be real-time, the opposite is not always true, i.e., baseband execution time below 1ms does not guarantee real-time operation. Many dependencies exist e.g., on front-end operations, radio and over-the-air latency, as well as the top level threading architecture, all of which are affected by numerous parameters; a generalization and quantification of all those is well-beyond the scope of this work.
B. OVER THE AIR (OTA) TEST RESULTS
This section presents the results and insights obtained by our initial indicative OTA measurements, as an application of SWORD's modes. The test platform was set in TDD mode at an operating frequency of 3.5 GHz and 5 MHz of bandwidth; the BS antenna array consisted of a ULA composed of half wavelength-spaced single-polarized elements. For each test, the modulation and coding scheme (MCS) was adjusted so that the throughput was maximized. Each test was conducted in six randomly chosen indoor locations, the latter not being necessarily the same for all tests. Three OTA tests were conducted: the first evaluation consisted of RT mobility tests with a single UE using single-user beamforming. Secondly, a test was conducted that compared the RT vs. the NRT-OTA modes of SWORD using the same MIMO scenario, with the intention of validating the new NRT-OTA mode. Finally, the recently proposed NL algorithms were tested and compared against linear detectors in an uplink 4 × 4 NRT-OTA setting.
1) SINGLE-USER BEAMFORMING RT MOBILITY TESTS
The intention of this trial was to show that SWORD can be employed to determine the loss of DL throughput that a mobile UE moving at pedestrian speed experiences compared to a static UE. Alternatively, this test can be thought of how accurately a beam follows a moving UE. The tests comprised of a BS with a ULA consisting of 8 single-polarized antennas and one single-antenna UE, running in LTE TM 7 (non codebook-based single-user beamforming) in an indoor setting, with the platform operating in RT. The channel estimates at the transmitter were obtained via UL SRS pilots, which were transmitted every 1 ms. MRT was used as the beamforming method. The testing procedure is as depicted in Figure 12 , where the first step was to take a throughput measurement at a starting position, while keeping the UE static, followed by a second measurement while the UE is moving at pedestrian speeds, and finally taking a third measurement at the stopping position, with the UE remaining static once more. The tests were performed at 6 different indoor locations of the BS and UE, and the results can be observed in Figure 13 . The results show that at pedestrian speeds, the beamformer tracks the UE accurately, with a average throughput loss of approximately 10%, with consistent throughput readings across instances of the tests.
2) VERIFICATION OF SWORD'S NRT MODE
To verify our OTA-NRT approach we compared its measured downlink throughput to that of the more traditional RT mode. The downlink throughput is a valid comparison metric because it reflects the effect of all the sub-components of SWORD. Naturally, we need to choose a MIMO mode with real-time support for this comparison. Therefore, the experiment setup comprised of a BS with a ULA consisting of 8 single-polarized antennas and one single-antenna UE, running in LTE TM 7 (non codebook-based single-user beamforming) in an indoor setting. MRT beamforming is employed and the beam weights are calculated based on the SRS reference signal that the UE transmits once per subframe. Figure 14 depicts the measured downlink spectral efficiency 3 of RT mode and NRT mode at 6 different indoor locations of the BS and a stationary UE. The results show, that NRT experiments and RT experiments produce very similar measurements. The maximum difference in terms of downlink throughput between RT and NRT is for all test locations less than 10% and the average throughput differs less than 1%. Hence, the experiment corroborates that the NRT mode can be employed to produce accurate performance measurements for static MIMO scenarios.
3) LINEAR VS. NON-LINEAR UPLINK MU-MIMO DETECTION IN NRT-OTA
This set of indicative measurements shows that SWORD can be employed to validate in a 3GPP compliant environment computationally complex approaches such as the uplink performance of the recently proposed non-linear and massively parallel detection techniques (e.g., [20] ). While this massively parallelizable detection approach has previously been validated via OTA experiments, these were only conducted through a partial/experimental PHY layer. To the best of the authors' knowledge, this has not been assessed within a 3GPP-compliant context, taking into account signaling and numerology. SWORD allows us to quantify the gains that massively parallel detection can deliver in real-world deployments before investing in the substantial software and hardware development effort that would be required for RT operation. SWORD, with its new NRT-OTA mode, is a well-suited tool to measure theses gains without the aforementioned significant development effort, since RT operation is not required. The evaluation setting consists of BS with a 4-antenna, single-polarized ULA serving 4 single-antenna UEs (4 × 4 MIMO) in an indoor setting. As the non-linear detection technique we implemented the approach in [20] extended by the LLR extraction technique presented in [74] .
In all experiments we set the number of parallel evaluated vector solutions to 32, since this value was shown to be adequate for providing near-optimal algorithmic performance [20] , [21] . The throughput results of the non-linear approach are compared to that of ZF. The tests were conducted over six different indoor locations of the BS and UEs; Fig. 15 shows a picture of the setup of the BS and UEs while conducting one of the experiments. The results are presented in Figure 16 , where it can be observed that a substantial increase in system throughput of NL vs. ZF was achieved in all six locations, with an average gain of 120%. These results validate that the link-level gains of NL vs. ZF presented in the literature (e.g., [20] , [74] ) are also reflected at system level.
VI. DISCUSSION

A. PROFILING AND OFFLOADING DISCUSSION
As illustrated in Fig. 11 , exploiting current SIMD technology can achieve significant savings with respect to the total PHY runtime. In this context, we note that despite our expansive optimisations there is plenty of room for further improvement, either through SIMD or through offloading, especially for upper PHY functions (e.g., channel decoder) consuming a significant portion of the total processing time.
Another major aspect that needs to be highlighted is the way that computation time scales with N BS ant and N layers . As expected, execution time for demapping, descrambling, rate-unmatching, channel decoding and deinterleaving primarily depends on N layers . As Figure 11a shows, execution time for channel decoding remains consistent for the same number of layers. The Tx mapping, scrambling, ratematching, channel encoding and interleaving operations exhibit a similar behavior. The remaining PHY operations depend on the N BS ant and N layers combination. For example, profiling DMRS channel estimation and detection/precoding has shown an increase in their respective runtime alongside N BS ant and N layers . Regarding detection and precoding procedures in particular, their runtime scaling behavior with respect to the MIMO configuration was measured and presented in Figure 7 .
As described in our profiling methodology (Sec. IV-D), we presented results for which OAI's multiple workers have been disabled. We note that OAI's (as of master branch v.1.0.3) supports multi-threaded execution for distributing the physical layer Tx and Rx subframe workload across more than one core. This was designed to enhance FDD operation since the latter requires both Tx/Rx processes to be executed during the same subframe. Our eNB measurements with the Tx/Rx split workers enabled have shown to have negligible effect; this was expected due to SWORD's TDD operation. We note that OAI also provides the option to employ multiple workers for parallelizing the execution of front-end processing and channel encoding/decoding procedures. Our initial experiments showed that enabling the builtin option of two workers on the eNB had a small impact of 23% on average, that only involved Tx front-end procedures. Hence, OAI's current status shows that there is plenty of space for exploring multi-core execution. Our results in Section V-A provided insight on how multi-processing can further accelerate SIMD operations, when targeting precoding/detection and OFDM subframe procedures (Figs. 7c, 7d and 9 ). Further analysis of AVX512 and multi-core optimization within the context of the whole OAI physical layer is left for future work.
This diverse experience towards SWORD required facing and addressing several challenges; some are ongoing, but those addressed have made us reach interesting conclusions. Our radio analysis showed that choices for large MIMO are not straightforward and require significant development effort. The modularity of the x86_64-based architecture can facilitate radio integration using COTS components and baseband development using OpenAirInterface. Extending the latter for large MIMO requires significant inter-layer development. AVX512 provides a clear DSP acceleration advantage of up to an order of magnitude compared with non-vectorized code, and multi-core execution potentially an order of magnitude on top of that. FPGAs can provide deterministic latency and offload CPU cores for other baseband tasks. Still, FPGA development effort is significant and should be thus exercised with caution; software optimization should be explored first. As 5G matures and we move towards future wireless standards, it is anticipated that physical layer functionality will need to reside in FPGAs or ASICs. To that end, research will also need to revisit algorithmic developments for enabling distributed processing.
B. OTA DISCUSSION
Three experiments were presented in Section V-B. In the first experiment we showed that SWORD can be employed to validate that a BS using single-user beamforming is able to ''follow'' a UE when it is moving at pedestrian speeds, when the channel estimates at the transmitter are obtained via uplink SRS pilots sent every 1 ms. The second experiment showed that SWORD's NRT-OTA mode can be employed to obtain similar results to the RT-OTA mode. Finally, the third test showed that SWORD can be employed to evaluate the performance of MU-MIMO uplink detection of ZF vs. NL techniques. Due to the computational complexity involved with those techniques, performing the test in RT mode would have required extensive hardware and software optimization, with a high expenditure of time and money. However, this evaluation was made possible by making use of SWORD's NRT-OTA feature, thus relaxing the RT requirement. The results showed that the system-level throughput of NL was on average 120% of that of ZF in a 4 × 4 MU-MIMO setting.
The experiments described above served to showcase the versatility of SWORD. Due to its software-driven paradigm, it is possible to evaluate a wide-range of scenarios, from RT experiments, MU-MIMO cases with different number of BS antennas and UEs, or evaluation of computationally-heavy signal processing techniques in NRT-OTA mode. Furthermore, since SWORD is a full-stack platform, it can be employed to quantify the system-level gains that are obtained through the use of physical-layer approaches.
VII. CONCLUSIONS AND FUTURE WORK
This work presented our experiences with SWORD, an open for collaboration, soft-driven, flexible, modular and extendable platform for wireless systems research using COTS equipment. We enhanced the potential of the OpenAirInterface SDR via real-time single-layer dynamic beamforming, multi-user transmission modes and support for built-in TDD reciprocity calibration. We introduced the NRT-OTA mode, which allows for rapid testing of advancements in large MIMO systems. We complemented our modular, x86_64-based architecture with a detailed profiling framework and an SIMD acceleration framework that harnesses the potential of AVX512, multicore execution and that can be potentially accompanied with PCIe-based FPGA offloading. Our extensive profiling results revealed the most prominent bottlenecks that need to be accelerated and revealed the limitations of modern GPP-based platforms. Through our SIMD framework we achieved up to 91% acceleration compared to non-vectorized code and up to 61% compared to OAI's SIMD routines. Furthermore, our initial indicative OTA measurements showcased that SWORD can be employed to perform diverse system-level evaluations of physical-layer techniques in RT and NRT modes. Future work involves further development to explore larger MIMO system aspects and wider bandwidths, NRT-OTA evaluations of medium access control (MAC) (or cross-layer) novel techniques, enhanced AVX512 acceleration, further functional splits for packet-based fronthaul and offload advanced algorithmic approaches onto programmable accelerators. Finally, as our framework matures, we aim to make it available for research collaboration.
