The ongoing evolution in constellation/formation of CubeSats, along with a steadily increasing number of satellites deployed in lower earth orbit, demands a generic reconfigurable multimode communication platform. As the number of satellites increase, the existing protocols combined with the trend to build one control station per CubeSat become a bottleneck for existing communication methods to support data volumes from these spacecraft at any given time. This paper explores the software-defined radio (SDR) architecture for the purposes of supporting multiple signals from multiple satellites, deploying mobile and/or distributed ground station nodes to increase the access time of the spacecraft and enabling a future SDR for distributed satellite systems. Performance results of differing software transceiver blocks and the decoding success rates are analyzed for varied symbol rates over different cores to inform on bottlenecks for field programmable gate array acceleration. Furthermore, an embedded system architecture is proposed based on these results favoring the ground station which supports the transition from single satellite communication to multisatellite communications.
I. INTRODUCTION
Small satellites are fast becoming a way to perform scientific and technological missions due to reduced build time, more frequent launch opportunities, larger variety of missions, more rapid expansion of the technical and/or scientific knowledge base, and greater involvement of small industries/universities [1] , [2] . Furthermore, there is an ongoing evolution of multiple small satellite scenarios such as FLOCK-1 [3] , QB50 [4] , Autonomous Assembly of a Reconfigurable Space Telescope [5] , Surrey Training Research and Nano-Satellite Demonstrator (STRaND-2) [6] , and Edison Demonstration of Smallsat Network [7] . The objectives of these missions are very ambitious and are driven by new complexities which require multimode operation of wireless transceivers [8] .
This work aims at three specific application areas. First, the ground station that can handle multiple satellite signals at any given time as seen in Fig. 1 . The increasing number of satellites in lower earth orbit occupying Amateur Radio Spectrum together with a variety of modulation techniques, data rates, and protocols [9] used across the CubeSat community demands the integration of a multitude of communication standards onto a single platform. This is compounded by the problem of crowded spectrum [10] which is driving research on more efficient use of the available spectrum, e.g., by deconfliction or Cognitive Radio techniques. For all such applications, a universal programmable hardware is desirable, which intensifies the interest in software-defined radio (SDR) in recent years [11] . Such an SDR must be robust in noisy and/or contested spectrum and make maximum use of a priori information to minimize initial acquisition and detection bandwidths.
Second, the need for deployable mobile ground station network for the purposes of increased access time such as ESA's Global Educational Network for Satellite Operations system [12] and Satellite Networked Open Ground Station (SatNOGS) [13] . A ground station based on SDR hardware is suitable for worldwide distributed systems, where updates containing the software for communicating with new waveforms could be shared among different distant stations without the need for hardware upgrades.
Finally, a candidate embedded design is presented as a possible enabler of the future SDR for distributed satellite communication systems. The growth of SDR offers small satellites the opportunity to improve the way space missions develop and operate transceivers for communication network in space as seen in [36] and [40] . The ability to change the operating characteristics of a radio through software once deployed to space offers the flexibility to adapt to new science opportunities and recover from anomalies within the science payload or communication system, e.g., in global navigation satellite systems receivers as in [38] , [39] . Also, it potentially reduces development cost and risk by adapting generic space platforms to meet specific mission requirements. However, the flexibility and adaptability comes with an expense of power consumption and complexity in integrating previously separated building blocks on a single die. The objectives of this paper are as follows.
1) SDR implementation and profiling analysis of SmallSat Telemetry, Tracking and Command waveform on state-of-the-art radio frequency (RF) (Analog Devices AD9361) and Base Band SoC (Xilinx Zynq)-based architecture, with emphasis on ground system multisatellite reception. 2) Profiling of C/C++ based reference waveform design on dual-, quad-, and octacore central processing units (CPUs) with the aim of moving minimum functionality from general purpose processor software to field programmable gate array (FPGA) firmware, in order to meet performance goals, maximize flexibility, and minimize expenses associated with implementation of many variant waveforms. 3) Using a low-cost Zynq SoC solution (the Zedboard), the desired multisatellite reception can accommodate up to four concurrent satellites by moving waveform independent front-end tuning, filtering, and decimation functions from software to firmware, leaving waveformdependent matched filtering, demodulation, and decoding functions in software.
This paper is an extension of the work carried out in [33] , [34] , where a novel SDR architecture on an embedded system is proposed as seen in Section II. The implementation and validation process of the proposed transceiver architecture is briefed in Section III (more details on transceiver implementation and validation can be found in [34] ). The focus of this paper is to understand the CPU load caused by each transceiver block as discussed in Section IV. Furthermore, in Section V, an improvement in the design is achieved by redistributing the transceiver blocks within the SoC. Lastly, Section VI summarizes the contributions and future work.
II. TRANSCEIVER ARCHITECTURE
For over two decades, SDR technology has promised to revolutionize the communication industry by delivering low-cost, flexible software solutions for communication protocols [9] . In this decade, the introduction of BB SoC and, most recently, RF programmable transceiver SoC can fulfill the early promise. Also, open source simulation tool such as GNURadio [35] is widely used to implement lowcost digital beacon receiver based on SDR [37] , emergency managers weather information network, and low-rate information transmission software receiver using GNURadio [41] . GNURadio was used in this research initially to understand the working of the existing/generated filters, channel codes, synchronization elements, equalizers, demodulators, decoders, and other processing blocks using prerecorded or generated data as addressed in [33] .
Toward achieving the attributes discussed in the previous section, this work proposes a new SDR architecture on an embedded system as seen in Fig. 2 [33] . This architecture consists of a BB SoC paired with RF SoC. The BB SoC contains FPGA fabric and ARM dual-core Cortex A9 processor. For initial development, the Avnet Zedboard containing the Xilinx Zynq 7020 FPGA SoC [14] is chosen providing a low-cost and well-supported back-end for the signal processing functionalities.
On the RF programmable transceiver SoC, initial evaluation took place using the Lime Micro Myriad RF containing the LMS6002D RF SoC [15] . More recent development has taken place using the Analog Devices AD-FMCOMMS3-EBZ containing the newer AD9361 RF SoC [16] . It is hoped that future developments will incorporate the latest and most capable Lime Micro SoC, the LMS7002M [17] . The two boards (and constituent SoCs) communicate using conventional parallel I/O for high-speed sampled data (up to ∼123 complex MSPS) and serial peripheral interface for configuration, control, and monitoring. Detailed description of the SDR architecture can be found in [33] , [34] .
III. IMPLEMENTATION
As a first step toward validating the architecture, a simple coder modulator/demodulator decoder reference model for a well-known CubeSat beacon telemetry was implemented. The FUNcube-1 (AO-73) CubeSat [18] provides a good starting point for our work because its telemetry beacon is documented and addressed by a number of open source software (OSS) demodulator decoder implementations written in C/C++.
A. Transmitter
The particular scheme, from AO-40 heritage [19] , common among several CubeSats [18] , is based on binary phase-shift keying (BPSK) modulation and a robust concatenated code comprising Viterbi (Rate 1/2) [20] and two Reed Solomon (160,128) blocks [21] . Much work here derives from Phil Karn's well-known AO-40 design and implementation [KA9Q] [19] . The Analog Devices AD-FMCOMMS3-EBZ has bare metal and Linux operating system (OS)-based device drivers accompanied by application examples. For this work, we have started with the Zynq ARM Linux OS-based approach as the integration and test of application related OSS may be simplified. To this end, analog devices provide a capable AD9361 Linux device driver, dependent on and accessed using Linux industrial I/O (IIO) framework [22] . Linux IIO allows user space waveform applications to configure/query/samplestream to and from the AD9361 using familiar UNIX calls (open/close/read/write/ioctl) and perhaps, and more preferred, by a user space library called libiio [23] . The Linux libiio provides a modern high-performance abstraction to Fig. 3 shows the signals being received on a FUNcube Pro+ dongle and spectrum analysis performed using SDR Sharp [24] .
It was possible to run iio_fcenc on different platforms of varying architecture and core capacity, including ARM Cortex 15 and ARM Cortex A7, ARM Cortex A9, and Intel x86. The transmission was also verified on a Rohde & Schwarz FSV3 Vector Signal Generator (VSP) [25] as seen in Fig. 4 and the constellation plot of the BPSK signal can be seen in Fig. 5 . The error vector magnitude is ∼2% which is within acceptable values for low-order modulations. The carrier frequency offset is 225 Hz from the center frequency (145.935 MHz) suggesting absolute accuracy of AD-FMCOMMS3-EBZ crystal to be ∼1.5 ppm. The AD-FMCOMMS3-EBZ provides the flexibility to transmit at any desired frequency within the range of 70 MHz to 6.0 GHz. Also, the freedom to adjust center frequencies and sample rates under software control helps compensating thermal drift, clock timing, and Doppler effects. This architecture demonstrates the SDR attributes such as postlaunch reconfigurability, scalability, and affordability to promote commercially available computer software and hardware products/standards which was not achievable by traditional transmitters.
B. Receiver
The chosen OSS starting point to form a "reference implementation" is Alex Csete's FUNcube Decoder (fcdec) available on github [26] . This C/C++ code base, targeted for Linux, is designed to work offline using sample files captured from the FUNcube Pro Dongle [24] . Using IIO, along with fcdec it has been possible to create a soft real-time reference decoder called "iio-fcdec" similar to "iio-fcenc." This was tested for interoperability against FUNcube-1 reference waveforms up-sampled, stored, and played back on a Rohde & Schwarz SMBV100 VSG [25] . The transmitted signals were looped back to the receiver port to transmit and receive the signals simultaneously. Fig. 6 shows the decoded packets from the loopback test. It was also possible to run iio_fcdec on an x86 PC and Odroid-XU Lite (Octa -ARM Cortex A15 Quad Core and ARM Cortex A7 Quad Core) [27] and stream samples from Zedboard (which is running iiod by default) over Ethernet network to compare the performance of the blocks on different processors. Different symbol rates (1.2, 2.4, 4.8 K 9.6, and 19.2 K) were achieved by changing the interpolation ratio and decimation ratio in iio_fcenc and iio_fcdec, respectively, similar to what was achieved on the Zedboard.
A practical problem encountered stems from the lowest filtered decimated sample rate, of order 1.5 Msps that can be output from AD9361 RF SoC. To address this, the AD9361 is configured to produce an integer multiple of an oversampled symbol rate (e.g., 40 × 1.2 K) that is conveniently larger than the 1.5 Msps limit imposed. In this implementation, 1.536 Msps was chosen that derives from 16 × 96 ksps. Therefore, the received sample stream is decimated by 16. The resulting 96 ksps sample stream has sufficient bandwidth to allow sufficient bandwidth to address spacecraft Doppler and oscillator uncertainties but discard LO breakthrough and IQ imbalance artifacts by halving the available bandwidth to ∼40 kHz. The 96 ksps sample stream is processed in software for flexibility and simple access to floating point arithmetic. This receiver processing is embedded (on the Zedboard's ARM Cortex A9) or streamed remotely to a more powerful host such as Intel x86 and Odroid-XU Lite as in Table I using sample streaming provided by Analog Device's iiod [28] .
The first signal processing step is coarse carrier acquisition performed using an 8192 point fast Fourier transform (FFT). This results in a further 96 ksps sample stream that is approximately band centered on the largest (wanted) carrier. A software-based finite impulse response filter, 27-taps long, containing a low-pass impulse response, is used to further filter and decimate the signal by factor of 10 to 9.6 ksps and offset by 1.2 kHz from baseband (for heritage reasons). At this stage, the underlying signal is down-converted to baseband and matched filtered followed by carrier phase recovery. Finally, from symbol timing recovery, a 1.2 K symbol stream is produced and passed to the decoder. As the receiver input signal bandwidth is limited to ∼40 KHz by the reference design the symbol rate was limited to 19.2 K (and still includes excess bandwidth for Doppler uncertainty). 
IV. PROFILING
Profiling can decompose and tabulate the execution weight of each block in the compiled C/C++ program. We are using GNU gprof [29] to identify critical regions, determine which blocks need to be optimized, vectorized, and/or moved to FPGA firmware (HDL). The aim here is to exploit vectorized instructions within the BB SoC hardcore (i.e., ARM NEON VLIW capability [30] ) and FPGA softcores (DDC/FFT [31] ) to optimize the implementation in order to accommodate more than one signal path on the BB SoC. GNU gprof helps in making the above choices in an educated and incremental fashion. During profiling, the packet/frame decoding, success rates are recorded to later aid results reconciliation. In this approach, the data rate is increased to (and beyond) the point that CPU (100) movt (141) mov (155) starvation sets in. Using a block-based waveform realized in pure software, the observed effects of CPU starvation are not catastrophic; rather, a graceful degradation occurs.
A. Transmitter Profiling Fig. 7 shows the flow of the computationally intensive transmitter blocks implemented on the Xilinx ZynqProcessing System where main() which is streaming the samples and FCsample() which is performing upsampling reports the maximum CPU consumption.
During profiling, both transmitter and receiver programs are executed for different data rates and on different platforms discussed in Table I such as Zedboard, Odroid-XU Lite, and Dell Optiplex 745 to understand the function distribution for higher symbol rates. Fig. 8 gives the comparison of the absolute CPU consumption on dissimilar platforms while the encoder is running at varied data rates. It is evident that the CPU consumption increases along with an increase in the symbol rates. The behavior appears linear on ARM Cortex A9 operating at 700 MHz, quadratic on ARM Cortex A15/A7 operating at 1.4/1.2 GHz, and cubic on Intel x86 operating at 2.13 GHz. This behavior can also be observed on relative CPU consumption plots of the encoder program across the platforms. Table II gives the relative comparison of the CPU consumption by different transmitter functions on Dual Core ARM Cortex A9. FCsample() which is upsampling and main() responsible for streaming the samples and managing buffers are the two dominant functions, with other functions being negligible. Though main() and FCsample() contribute ∼50% toward the CPU consumption at 1.2 K, the relative contribution of FCsample() increases, whereas main() decreases linearly with symbol rate.
The behavior of these functions on "the quad cores" ARM Cortex A15 and A7 appears quadratic. Here, the sample streaming is quicker compared to "the dual core" ARM Cortex A9 and therefore the FCsample() dominates over The aim here was to identify the most dominant block which is distinctly FCsample()-performing upsampling, and this will be moved to FPGA firmware (HDL) for optimization and thereby enabling multiple signal transmission.
B. Receiver Profiling
Similarly, Fig. 9 shows the computationally intensive functions in the receiver chain (on Intel x86). The downsampling is done in three different stages: main(), in the function called go and RxDownSample(), which are the most dominant followed by ProcessFFT(), where the appropriate signal is selected. Fig. 10 shows the absolute CPU consumption on dissimilar platforms while the decoder is running at varied data rates. The decoder consumes more than 50% CPU at 1.2 K on ARM Cortex A9 and reaches almost 100% (appears linear) resulting in low success rate as the symbol rate increases (see Fig. 11) . The behavior appears quadratic on ARM Cortex A15 and A7 and reaches ∼50% at higher data rate.
This results in unsuccessful decoding at data rates 9.6 and 19.2 K as shown in Fig. 11 along with differences in profiling behavior as shown in Table III . Whereas Intel x86 which exhibits cubic behavior is well within 50% even at 19.2 K and ensuring 100% success rate.
The function main() decimating the samples from 1.536 Msps to 96 ksps becomes less dominant as the data rate increases on all three platforms whereas the function go() which does further downsampling from 96 ksps to 96 bps along with Reed-Solomon and Viterbi (embedded within FECDecode, which works on hard bits but is capable of working on soft symbols) is more prevalent on the ARM Cortex A15 and A7 when compared to ARM Cortex A9 and Intel x86 and thus suppressing other functions as seen in Table III . Although there is a significant difference in the relative CPU consumption between two different stages of downsampling [main() and go()] at 1.2 K, the difference gradually reduces and they consume almost the same CPU (∼50%) at 19.2 K on ARM Cortex A15 and A7 unlike on ARM Cortex A9 Intel x86 where the downsampling from (16) . In addition, the compilers were found to be different across the platforms as seen in Table IV . There is a difference in the instruction sets used across various architectures to perform similar functions and was observed using "objdump." On Intel x86 architecture, "move" instructions dominate over "add" functions, whereas on the ARM architectures "add" instructions are called more frequently. This may suggest that memory operations are the key, reducing the number of read/write to the memory and decimating the samples would make the design more efficient.
V. IMPLEMENTATION OF DDC BLOCK IN FPGA
Based on the profiling results obtained earlier, it is evident that the upsampling and downsampling are computationally intensive blocks in the transceiver. The architecture was revised in order to efficiently utilize the FPGA firmware and take advantage of its flexibility and speed. The FPGA firmware was reconfigured to include the sample DDC block. The reference design includes the core from Analog Devices which fetches the samples from RF SoC interface core and provides them to Zynq PS for further processing. The sample DDC block was implemented in between RF core (AXI_AD9361) and sample packer block which packs I and Q signals from different channels before the signal is stored in direct memory access (DMA). Other blocks such as modulation/demodulation, frequency/phase correction, and packet handling which are computationally less intensive were retained in ARM Cortex A9 processors.
A. Postprofiling Results
Once the DDC block was implemented on the FPGA fabric, profiling was repeated to understand the improvement achieved. Fig. 12 shows the percentage reduction in the absolute CPU consumption at different data rates. This improvement in the average load allows parallel reception of up to five signals (without accounting for instantaneous peak load), so up to four signals could be a better expectation at 1.2 K while was limited to one earlier. Similarly, two signals at 2.4 K can be decoded simultaneously in place of a single signal.
The first digit in the x-axis stands for hardware decimation and the second digit for software decimation. It is unambiguous that as the hardware decimation increases the CPU consumption decreases. Table V summarizes the improvement achieved at different data rates. Similar progress can be seen in relative performance measures as seen in Table VI , main() which was contributing toward 89.6% of the CPU is now reduced to 50% with hardware decimation at 1.2 K, from 84.05% to 56.72% at 2.4 K, 74.58% to 57.12% at 4.8 K, and from 60.23% to 52.59% at 9.6 K. Table VII shows FPGA processor logic utilization before and after the front-end DDC function being moved to firmware and includes the percentage increase in FPGA utilization that results. Adding a sample DDC block to the original increased the power consumption and the hardware requirements. Total overhead of on-chip power is 13.14% with 5% increase in flip-flops and memory LUTs, 8% increase in LUTs, 18% increase in BRAMs, and 3% increase in DSPs. This analysis suggests tha approximately four-five DDC/DUCs can be implemented in order to aid parallel reception. Here, we use the Zynq 7020 but in case of more number of signals with higher data rates, a larger FPGA may be selected [32] .
VI. CONCLUSION
This paper presents the design and implementation of an adaptive SDR architecture on different platforms with varied symbol rates such as 1.2, 2.4, 4.8, 9.6, and 19.2 K. C/C++ was preferred over VHDL for initial implementations due to the reduced implementation time of simple blocks such as decoding/encoding/demodulation and modulation. Profiling using gprof tabulates the relative and absolute performances along with success rates due to CPU saturation. Also, the functions exhibit diverse behavior such as linear/quadratic and cubic on dissimilar platforms. The obtained performance results demonstrate the need to move blocks demanding higher computation capacity such as up/downsampling blocks.
The sample DDC block was moved to FPGA and the postprofiling results show the improvement in the performance thereby facilitating more than one signal at any given time, the significant improvement being at lower data rates such as 36.76% at 1.2 K and 31.14% at 2.4 K. This comes with a cost of 13.14% more on-chip power and 5%-15% increase in on-chip resources. Therefore, it has been concluded that for this reference design, moving the front-end DDC function alone, from software to firmware, is sufficient to allow multiple satellite reception at typical CubeSat telemetry rates.
Future work includes the implementation of the proposed design with n-stage pipeline architecture on FPGA SoC as shown in Fig. 13 based on different stages of transmission synchronization. The objective of the pipeline architecture is to receive the signal from more than one satellite operating at different modulation techniques, data rates, and center frequencies. RF SoC would acquire the desired signal present in the spectrum with predefined software configuration of the front end such as gain, filters, bandwidth, and center frequency.
The next stage in the architecture includes parallel wrappers of DDCs consisting of digital quadrature tuner and cascaded integrator comb blocks. Each valid signal in the spectrum is mapped to separate channel based on the available on-chip resources. Each signal is stored under different offset address in the DMA which is configured according to the precalculated memory requirements. The last stage in the architecture is proposed to be asynchronous as the signal stored in the DMA can be accessed independently using different decoder threads running on dual core processors.
Using a single programmable baseband SoC to execute several baseband processing programs at the same time can benefit in increased hardware reuse, shared software kernel functions, and use of shared information, such as link state and channel parameters. However, in order to avoid data loss or dropped packets or frames, the combined FPGA logic and processor must have the resources to support the worst case load in all supported standards simultaneously.
In conclusion, this paper demonstrates the concept of combining state-of-the-art low-cost SDR hardware and OSS tools toward achieving a new generic communication platform for satellite communications. Potential applications of the proposed embedded system architecture are the ground station for multisatellite communications, deployable mobile ground station network, and can be further extended to distributed satellite system.
