The communication range of wireless networks can be greatly improved by using distributed beamforming from a set of independent radio nodes. One of the key challenges in establishing a beamformed communication link from separate radios is achieving carrier frequency and sample timing synchronization. This paper describes an implementation that addresses both carrier frequency and sample timing synchronization simultaneously using RF signaling between designated master and slave nodes. By using a pilot signal transmitted by the master node, each slave estimates and tracks the frequency and timing offset and digitally compensates for them. A realtime implementation of the proposed system was developed in GNU Radio and tested with Ettus USRP N210 software defined radios. The measurements show that the distributed array can reach a residual frequency error of 5 Hz and a residual timing offset of 1/16 the sample duration for 70 percent of the time. This performance enables distributed beamforming for range extension applications.
INTRODUCTION
Cooperation among geographically distributed radios, aka distributed arrays, provides a performance boost in a variety of applications. Distributed beamforming (DBF) is a nascent wireless communications technology that seeks to offer signal power gains using distributed arrays. With distributed transmitter and receiver beamforming, groups of radios transmit messages over a longer distances and more power-efficiently than any single node pair could achieve 978-1-5386-6854-2/19/$31.00 c 2019 IEEE [1, 2] . With N T nodes in a transmit group and N R receive nodes, a joint transmit/receive beamforming system can theoretically achieve a potential gain improvement of N 2 T N R relative to a point-to-point link. The extremely concentrated energy field from DBF can also be used for far-field wireless energy delivery [3] which reduces energy in unintended areas where humans could be adversely affected by microwaves [4] . Distributed multiple-input multiple-output (DMIMO) is another application of distributed arrays which relies on the increased virtual array dimension to provide the required degrees-of-freedom for spatial multiplexing and interference control. This technique facilitates significant spectral efficiency improvement [5] , [6] . A distributed array can also synthesize a larger virtual aperture and provide orders-ofmagnitude higher precision in localization [7] and radar imaging [8] applications.
Synchronization among radios is the key requirement for distributed beamforming, MIMO, localization and imaging, since carrier frequency offset/drift, and sample timing offset introduce non-coherence and severely degrades performance. Synchronization schemes in most recent prototype studies [2, 3, 5, 8] mostly rely on dedicated cables, e.g., copper and optical fiber, or GPS clocks. The over-the-wire reference signals, e.g., 10 MHz and pulse-per-second (PPS) signals, greatly simplify the synchronization, but they are not suitable for mobile systems. Meanwhile, reliance on GPS clocks may greatly increase the cost of the radios. A digital signal processing (DSP) based in-band synchronization protocol is an alternative solution, favorable due to its potentially low complexity and cost in synchronizing distributed arrays. DSP oriented synchronization is especially suitable for distributed mobile radios, e.g., cooperative unnamed aerial vehicle access [9] and backhaul [10] . In this work, we focus on the inband carrier frequency and sample timing synchronization. Synchronization among distributed radio nodes has been analyzed theoretically; for example, see the survey papers [11, 12] and references therein. Recently, a number of softwaredefined-radio (SDR) proof-of-concept implementations have been reported. Works [13, 14] present a novel system for transmitter or receiver beamforming where slave radios estimate a continuous wave (CW) signal from the master node and use this to estimate the carrier frequency offset and adjust the phases of baseband samples to reach synchronization. In [15, 16] an SDR implementation of sample timing is developed while using perfectly synchronized carriers with 10 MHz references. Reference [17] considers a broadcast system and sub-µs level timing alignment is achieved for the WiFi OFDM waveform. An in-band synchronization protocol is developed in [18, 19] which results in a performance boost, as measured by throughput. While these studies have advanced the state-of-the-art in DBF, there is a scarcity of studies on SDR implementation of joint frequency and timing synchronization that supports real-time DBF and DMIMO data communications.
In this work, we focus on the real-time SDR implementation of joint carrier and timing synchronization for DBF applications using the Universal Software Radio Peripheral (USRP) [20] . We develop a lightweight protocol for distributed radio nodes to achieve synchronization with a master-slave architecture. As opposed to [18, 19] , we focus on synchronization accuracy rather than throughput improvement. We use the theoretical performance bound of synchronization accuracy as well as required accuracy for DBF from our simulation as guidelines in protocol design. The experimental results show that the proposed scheme achieves under 5 Hz residual frequency error. The timing accuracy can be lower than 1/16 of the sample duration nearly 70-percent of the time, based on an over-night trial. Our proposed protocol and algorithm are tailored for DBF applications, but are straightforward to extend to other applications as well.
The rest of the paper is organized as follows. In Section 2, we introduce the concept of distributed arrays and the system model for synchronization. We present the proposed synchronization protocol and DSP algorithms in Section 3 and 4, respectively. The theoretically achievable and the required synchronization accuracy analysis for practical scenarios is presented in Section 5. SDR implementation details are discussed in Section 6 and experimental results, which verify our analysis, in Section 7. Section 8 discusses the limitations of the current implementation and future research directions. Finally, Section 9 concludes the paper.
Notations: Scalars, vectors, and matrices are denoted by nonbold, bold lower-case, and bold upper-case letters, respectively, e.g. h, h and H. The element in the i-th row and j-th column in matrix H is denoted by {H} i,j . Conjugate, transpose and Hermitian transpose are denoted by (·) * , (·) T and (·) H , respectively.
SYSTEM MODEL
In this section, we introduce the concept of distributed array systems and we present the challenges in achieving synchronization between radios. The nomenclature used in the paper is summarized in Appendix A.
Distributed array system concept
Consider two groups consisting of single antenna radios, as shown in Figure 1 , trying to communicate wirelessly. Each radio has its own clock for carrier synthesis and sample timing. In order for each group to behave as an antenna array, thus performing cooperative communication, all the radios within the group need to be synchronized.
The system we propose uses the same frequency band for synchronization and cooperative communication, and these two functions are interleaved in time. We use a masterslave based intra-group synchronization architecture which is suited to reduce the synchronization time for a small scale group having direct links between the master and the slaves. In the synchronization phase, the master broadcasts a preamble and the slaves use it to estimate the frequency and timing offsets between themselves and the master. The slaves then compensate for the offsets using baseband processing in order to enable cooperative communication.
System model for synchronization
Let us consider the synchronization between the master and one of the slaves. The results obtained can easily be extended to multiple slaves due to the master-slave architecture. We focus on a narrow-band system and leave the wide-band system for future work.
Due to the use of multiple oscillators and timing signals, there is a sample timing offset 2 (STO) ∆τ , with units of seconds, and a carrier frequency offset (CFO) ∆f , with units of Hz, between the master and each slave. The sample duration T s is identical 3 in the master and the slave. For notational convenience, we define the normalized CFO as F = 2π∆f T s .
Let us denote the transmitted signal from the master as s[n].
Then the digitized signal at the slave can be expressed as
where the complex channel gain and hardware phase offset between master and slave is denoted as h 0 , the thermal noise at the slave radio is modeled as additive white Gaussian noise (AWGN) w[n] ∼ CN (0, σ 2 n ), and the signal to noise ratio is defined as SNR sync = |h 0 | 2 /σ 2 n . Due to the delay ∆τ , there is a misalignment in transmitter and receiver sample ticks. Therefore, continuous-time analog filtering 4 at the transmitter is used and is modeled as the function p ps (t). The exact expression of such a filter is hardware dependent and is commonly unknown in SDR implementations. For tractable algorithm design in this work, we utilize the approximation that p pc (t) is a known low-pass interpolation function.
The objective of synchronization is to design a protocol and a signal processing algorithm which together allow the slave to use the received signal to estimate the CFO ∆f and STO ∆τ .
PROPOSED PROTOCOL
In this section, we introduce the proposed frame structure as well as the procedure for synchronization. In order to quantify the required overhead in synchronization, the cooperative communication procedure is also briefly introduced.
Frame structure
The proposed frame structure is shown in Figure 2 . The frame contains a periodic synchronization preamble and cooperative communication segment with period T frame . This value depends on the drift rates of frequency and timing offsets, which requires calibration of the hardware.
Due to constraints of implementation as well as overhead efficiency concerns, the proposed frame does not include a transmitter time stamp as is commonly utilized in distributed embedded systems. We propose to use an M -repetition (M ≥ 2) Zadoff-Chu (ZC) sequence [21] , s zc [n], n ∈ [0, N zc − 1] with length N zc , as the synchronization preamble. The ZC sequence is known for its perfect cyclic-autocorrelation property, i.e.,
Note that the duration of the synchronization preamble is
The synchronization preamble used in (1) is expressed as
A fixed time duration T guard , known to both slave and master, is reserved for the slave nodes for baseband processing. The specific value depends on hardware implementation, e.g., the propagation delay, radio-frequency (RF) and baseband (BB) processing delay.
The cooperative communication interval is divided into three phases, intra-group channel estimation (ChEst), inter-group ChEst, and data communication. The duration of intergroup ChEst T inter depends on the channel variation due to environment changes. The length T intra of intra-group ChEst depends on the intra-group channel variation due to residual synchronization error and is used in evaluating overhead in Section 7.
Procedure
The master radio obeys the frame structure defined above, and implements it using its own clock.
In an asynchronous manner within the group, each slave radio actively searches for the synchronization preamble. Upon detection of the preamble, the slave uses the received signal to estimate the carrier and timing offsets relative to the master, and use their estimator ∆f and ∆τ to adjust baseband signal during the cooperative communication period that follows. Specifically, the slave utilizes the known idle period for necessary processing to adjust the baseband signal such that the transmitted and received inter-group signals are synchronized. The slave radio also utilizes the periodic nature of the preamble to retain tight synchronization as well as improve Cooperative communication occurs when the synchronization is reached within a specified accuracy range, i.e., residual synchronization error. As such, the cooperative communication stage has residual CFO ∆ R,F ∆f − ∆f and STO ∆ R,T ∆τ − ∆τ , which is much smaller than the original errors F and T .
PROPOSED SYNCHRONIZATION ALGORITHMS
In this section, we present the proposed DSP based synchronization algorithm used by the slave radios.
For notational convenience we first re-write the STO as ∆τ = T T s = (d + ζ )T s , where T is the sample-durationnormalized timing offset, with integer part d = ∆τ /T s and fractional part ζ = T − d , i.e., ζ ∈ [0, 1). The T s -normalized residual timing error is denoted as R,T ∆ R,T /T s . Also, we define the normalized CFO as F = 2π∆f T s .
The received signal (1), after plugging in the ZC sequence expression and using the above notational modification, is rewritten as
Preamble Detection and Integer Sample Timing Recovery
The zero-autocorrelation property of the ZC sequence provides a simple method for preamble detection. 
where η is the detection threshold whose optimal value can be calculated based on the thermal noise power. The 2-norm summation over M adjacent N zc windows is to avoid phase rotation from frequency offset 5 .
Moreover, based on the location of the correlation peak, the integer timing offset d can be estimated viâ
It is worth noting that the perfect autocorrelation property of ZC sequences is degraded due to fractional-sample delay between master and slave. In other words, when ζ ≈ 0.5 a small variation in y corr [n] due to AWGN causes the correlation peak to lock either tod = d ord = d + 1. To deal with this, a fractional delay estimator must be designed and implemented, as we describe later.
CFO Estimation and Compensation
We start with a one shot estimation algorithm which utilizes the known repetition pattern of the preamble signal for CFO estimation. Due to the fact that the received signal within the sample window [d ,d + M N zc − 1], i.e., the estimated time window for M consecutive ZC sequences in terms of the slave's clock, is periodic with period N zc , we propose to use
N zc (7) as the normalized CFO estimate.
The estimate for ∆f is available by straightforward scaling of ∆ˆ F . The angle operator ∠(·) returns the phase of a complex number and can be computed by
Furthermore, this approach intrinsically assumes that the phase rotation due to CFO is unwrapped, i.e., F N zc < 2π. With the value N zc = 63 this condition is easily met (see footnote 5) . For applications in which a longer ZC sequence is required for more processing gain, [22] provides an algorithm for wrapped phase measurement.
Due to the fact that our system uses a dedicated preamble for CFO estimation rather than an unmodulated tone [13] , the delay error is crucial in the design. The reason for using auto-correlation in (7) is its resilience to the unknown fractional delay, i.e., in the high SNR sync regime,
gives an estimate of e j F Nzc regardless of T and the potential 1-sample ambiguity ind . In contrast, approaches which rely on directly decoding s zc cannot avoid estimating the fractional delay for CFO estimation.
The CFO estimate can be filtered to provide more accuracy. The extended Kalman filter (EKF) is commonly used as an averaging filter [13] . The state vector consists of the phase offset φ and normalized CFO F between the master and slave. Assuming the channel variation h 0 is compensated, 5 The normalized CFO is up to F ≈ 0.075 rad based on the maximum USRP frequency offset of ±5ppm, when using a 2.4GHz carrier frequency, and a 1MHz sampling rate. The complex envelop due to CFO has its phase rotating with values larger than 2π, which severely degrades the correlation peak if we used a long ZC sequence as compared to multiple repetitions of a shorter sequence. With such a setting, the ±5 ppm sample clock skew introduces 0.05Ts STO drift within 10ms and it is negligible in a DBF application as shown later in our simulations.
the state update function is expressed as the following linear equation
where we add subscript k to indicate the k-th preamble for clarity and we use this notation in the remainder of this subsection. Note that N CYC,k is the number of samples between the k-th and the (k + 1)-th preambles. By default N CYC,k = T CYC /T s , ∀k if the master hardware timer is perfect. In practice, slaves should retain a timer to measure the actual time gap between adjacent synchronization preambles. The vector v k contains the phase drift and frequency drift values at time instant k.
The observation vector contains the indirect measurement of the phase offset estimator and CFO estimator from (7) .
Specifically, we use cos(φ k ) = (r[d k ]) and sin(φ k ) = (r[d k ]) from the actual received signal. The vector w k contains measurement errors whose variances are determined in an empirical manner. In processing the preamble, the slave filters the estimated CFO via standard Kalman Filter steps [23] and provides a CFO estimatorˆ F with better precision.
The CFO in the received signal can be compensated viã
and the following baseband signal is used in cooperative communication. It is worth noting that the above approach is not intended to adjust the phase offset.
Fractional Sample Timing Recovery
Since the correlation and peak detector only provide coarse timing information, i.e., nearest receiver sample instance, the accuracy is largely compromised. We propose to use a maximum likelihood estimator for fractional delay estimation and utilize the fact that the search range is within [0, T s ) of the estimated integer delay.
Due to the fact that the integer delay estimatord is sensitive to AWGN when ζ is close to 0.5, an approach for dealing with such errors is required. We propose to use a simple approach and set the fractional delay candidate to be within [−0.5T s , 0.5T s ) rather than [0, T s ) in estimation. This facilitates automatically compensating integer estimation error.
For this purpose, the digitized signal is correlated with a bank of filters, each of which is a fractionally delayed ZC sequence.
where the ζ-fractionally delayed ZC sequence r (ζ) zc [n] is obtained from linear intepolation, i.e.,
Note that the optimal design requires knowledge of the transmitter RF filter, p ps (t), but mismatch in this knowledge causes only minor degradation. Practically, N ζ fractionally-delayed ZC sequences are used and ζ-candidates are chosen from dictionary with a step-size T s /(N ζ + 1), i.e.,
and we choose N ζ according to the residual STO target.
Computational complexity
The complexity is divided into two parts: always-on preamble detection and detection triggered synchronization algorithm. The detection algorithm uses an N zc -tap finite impulse response (FIR) filter and therefore requires M N zc complex multiplications and accumulations at each sample. Once the preamble detection is triggered, a total of M N zc samples are used for additional processing, i.e., CFO estimation and fractional delay estimation. The former requires (M − 1)N zc complex multiplications and accumulations, 1 division and phase computation. The latter requires N ζ pre-computed τ -delayed ZC sequences. Correlation with each sequence requires M N zc complex multiplications and accumulations and requires a total of M N zc N ζ complex operations.
ACHIEVABLE AND TARGET ACCURACY ANALYSIS
In this section, we start with a discussion of the theoretically achievable synchronization offset estimator variance. Then, we use simulations to study the impact of residual synchronization error on DBF performance. This analysis serves as a guideline in evaluating our algorithm and SDR implementation.
Analysis of achievable accuracy
Reference [24] shows that the maximum likelihood estimator achieves the Cramér-Rao lower bound but such an estimator actually estimates the accumulation of hardware delay and propagation delay and does not distinguish between them.
From the Cramér-Rao lower bound perspective, the variance of CFO estimation and STO estimation are var(∆f ) ≥ CRLB(∆f ) = 3 2π 2 T 2 est SNR sync (14) var(∆τ ) ≥ CRLB(∆τ ) = 12πT s 3 T est SNR sync (15) where T est = M N zc T s is the observation duration [25] , [26] . Consider a setting with T s = 1 µs and M = 10 repetitions of ZC sequence, i.e., T est = 0.63 ms, and 20 dB SNR. The theoretical CFO estimator standard deviation is 61.9 Hz. With the help of a Kalman filter the accuracy, with averaging over multiple preambles, can be improved. On the other hand, the theoretical fractional delay estimation standard deviation is 0.008T s , a specification good enough for cooperative communication as shown in section 7. This implies that filtering of the delay estimator over multiple preambles is not necessary.
It is worth noting that the proposed approach for timing estimation does not incorporate the propagation delay between master and slaves. In other words, the estimator is not unbiased and in the proposed approach E| R,T | ≥ L/c where L is the longest intra-group distance between master and slaves. This design is intended for small scale distributed cooperative communication and therefore the propagation delay within the group is not addressed. The work [16] provides an opposite way of timing estimator design where closed-loop forward and backward links are used to address the intra-group propagation delay at the cost of having to synchronize each master/slave pair independently, rather than joint synchronization by broadcasting.
Simulation of synchronization requirements
First we use simulations to study the impact of residual synchronization error on system performance. We consider a distributed beamforming system for range extension in line of sight (LOS) channels between N T transmit and N R receive radios. The inter-group distance is 20 km. The intra-group radio distance is 50 m on average, and the LOS environment features a very sparse channel with rank 1. The SNR between any pair of transmitter and receiver radios is -1.5 dB 6 . Simulations were conducted with perfect phase adjustment at the beginning of the cooperative communication sub-frame, i.e., N 2 T N R beamforming gain. The gain drops as the phase coherence degrades due to residual frequency offset. We use the Signal to Interference and Noise Ratio (SINR) as a metric to evaluate the coherence of the system. The interference is due to the intersymbol interference resulting from the timing synchronization errors among symbols.
We evaluate the performance of the system under Gaussian distributed frequency or timing residue errors individually. From this evaluation, we estimate the required accuracy of synchronization. To achieve a Bit Error Rate (BER) of 10 −5 using 16 QAM modulation, we need an SINR of at least 20 dB. We wish to determine the highest acceptable RMS residual timing and frequency errors to achieve this SINR.
For frequency errors, Figure 3 shows the post beamforming SINR at the end of a frame where cooperative communication duration is 5 ms. In this figure, the x-axis refers to the number of transmitter (N T ) and receiver (N R ) radios. The results indicate that to achieve a 20 dB SINR using a group consisting of N T = N R = 8 radios, we need the frequency RMS error ∆ R,F = ∆f − ∆f to be within 20 Hz. As for the timing with the same group size, Figure 4 shows the post beamforming SINR is greater than 20 dB when the RMS residual timing error is within T s /8. 
SOFTWARE-DEFINED RADIO IMPLEMENTATION
The proposed protocol and the synchronization algorithm were implemented using the USRP N200/N210 software defined radio kits. In our system, we used 5 USRPs. One USRP was used as a master, two USRPs as slaves, and two USRPs acted as receivers whose purpose was to evaluate the performance of the system. In this section, we start by introducing the capabilities of the SDR kits. Then we describe the parameters and details of our implementation.
Introduction to the USRP N200/N210 platform
The Ettus USRP N200/N210 series is a software defined radio kit designed for RF applications from DC to 6 GHz. The RF capabilities of the USRP are determined by the installed RF daughter-board. Our implementation used a set of heterogeneous RF daughter-boards including the SBX having an RF frontend that can operate in the frequency range from 0.4 to 4.4 GHz, and the XCVR2450 which is operational from 2.4 to 2.5 and 4.9 to 6.0 GHz. Besides the RF frontend, the USRP consists of an analog-to-digital converter (ADC) and digital-to analog converter (DAC) as well as a field-programmable gate array (FPGA). The FPGA implements some simple digital signal processing such as upsampling and downsampling signals to the rates required by the ADC/DAC. It is also used to communicate with the host computer. On the host, the USRP Hardware Driver (UHD) provided by Ettus is used to interface the host to the USRP. Digital signal processing software such as GNU Radio can be used to operate the USRP. Each USRP has its own oscillator and its own timing clock. In order to support MIMO communications, three options are available for carrier and timing synchronization [27] . These options include (1) a MIMO cable, which lets two USRPs share both frequency and timing signals, (2) using an external 10 MHz carrier reference clock and pulse-per-second (PPS) timing synchronization signal, or (3) using a GPS Disciplined Oscillator (GPSDO) 7 . Except for the GPSDO, the other two options require external connections which are not suitable for a distributed system. As for the GPSDO, its accuracy depends on the ability of the USRP to receive a GPS signal, which renders it unsuitable for indoor deployments, and gives limited frequency synchronization ability.
It is worth noting that the USRP internal clock is not stable as it retunes the frequency after each transmission and this results in abrupt frequency changes. This phenomenon is a result of the carrier generator design and could have been avoided by alternate design choices. As a stable external clock is required for implementing distributed beamforming and MIMO communications, we resorted to attaching each USRP to a stable external clock source.
Implementation Details of intra-group synchronization
The proposed algorithm for timing and frequency synchronization was implemented using GNU Radio. GNU Radio is an open-source toolkit for software defined radios and signal processing. It was used to operate all the USRPs in the experiment.
We start by describing the parameters of the implemented system and then briefly discuss the details of the implementation. Ten repetitions (M = 10) of a ZC sequence of length N zc = 63 were used in the pilot signal sent by the master. After transmitting the pilots, 4 ms of guard time was provided for slaves to do their processing and then data was transmitted. The slaves always correlate with the ZC sequence looking for a peak. Once they detect it, they schedule their data transmission after the guard time which is used to estimate the frequency offset by Kalman filtering for smoothing as described earlier.
The 4 ms guard time was chosen as a conservative value since the processing occurs in the PC and the latency needs to be taken into account. One way to reduce such overhead is to implement the synchronization protocol in the FPGA of the USRP, which would would enable us to use a much shorter guard time as was done in [17] . As the generation of the master signals and capturing the data for analysis at the receiver nodes is straight forward, we focus on describing the implementation of the slaves. The block diagram of the system is shown in Figure 5 . The input signal from the USRP is sent to the input of an FFT filter which continuously correlates with a ZC sequence. The output of the correlator is passed through a peak detector, which works by taking a moving average and outputing a trigger when the input surpasses ten times the average. This is used to estimate the integer delay d similar to what is described in (6) . This trigger is passed to the CFO EKF block, which is a custom block written in C++. This block estimates the phase between successive repetitions of ZC sequences and processes the input as described earlier using a Kalman filter to obtain estimates of the residual frequency error. As for calculating the correct time of the transmission and compensating for the frequency and phase errors, we developed the frac timing freq phase synch block. It takes as input the first trigger corresponding to the first occurrence of a ZC sequence. To calculate the fractional delay ζ , this block performs the processing on the input from the USRP described by (11) using N ζ = 16. Once this trigger is detected and the fractional part is estimated, a burst of data is scheduled to be transmitted after a period equal to the guard time. A frequency synthesizer in this block uses the phase and frequency obtained from the CFO EKF block to compensate for the error on the data to be transmitted, which is obtained from a vector source. The data after frequency and phase adjustment is then marked at the beginning and the end by burst tags and a tx time tag containing the correct transmission time is placed in the first byte of the burst. Burst tags are a feature of the USRP Hardware driver (UHD), which allows the USRP to perform bursty transmissions. The tx time tag allows users to have a high accuracy control of the transmission time of a data burst. It works by sending a timestamp of the desired transmission time to the USRP along with the data. The USRP delays the burst transmission until the correct time has come. This timestamp is not transmitted over the air; it is used only to control the USRP.
As mentioned in Section 4, our timing synchronization algorithm does not distinguish between the hardware timing offset and propagation delay and therefore the accuracy highly depends on differences in intra-group distances among radios. In our experiment, the relative distances among the transmit nodes were less than 1m, which results in accuracy biasing up to 3.3ns. The distance between the transmitter and the slaves was less than 3m and the SNR was controlled by adjusting the transmission power gain in the USRPs. A synchronization protocol that incorporates propagation delay is left for future work as this delay becomes more critical when intra-group distance and signal bandwidth become larger.
EXPERIMENT RESULTS
In this section, we present the experimental synchronization accuracy using the SDR platform.
Experimental Setup
Our proposed protocol and algorithms were tested using a system consisting of one master and two slaves. A high sampling rate oscilloscope and two additional USRPs were used to evaluate the timing and frequency synchronization.
To overcome the instability of the internal oscillator of the USRP, each of the master and two slaves was connected to a unique external oscillator. The two USRPs used for evaluating the system were synchronized using a MIMO cable to avoid having any timing and frequency drifts between them to provide accurate measurements of the performance of the slaves. The oscilloscope used was the Tektronix DLS6154, which has a sampling rate up to 40 GSample/s, which enabled us to obtain a high resolution estimate of any remaining timing error between the master and slaves. A schematic of this setup is shown in Figure 6 , while a photo of the actual setup is shown in Figure 7 .
For this experiment, the carrier frequency used was f c = 2.4 GHz and the signal bandwidth of the master and slaves was 1/T s = 1 MHz. The two synchronized USRPs were used to measure the residual CFO (RCFO). Because the RCFO was below 50 Hz, the measuring USRPs used a lower sampling rate of 0.25MS/s to capture the compensated signals from the slaves. Then the captured signal was digitally downsampled by a factor of 250. By using a 256-point FFT for our RCFO evaluation, we were able to measure the RCFO with a resolution of 0.5Hz. For measuring the residual timing offset we used the Tektronix DLS6154 oscilloscope. Data was recorded using the oscilloscope with a time resolution of 800ps. The residual timing error was estimated by correlating the captured waveform from the master and one of the slave USRPs. A total of 650 waveforms were captured during a period of four hours in order to incorporate long-term effects such as heating.
The host computers used as slaves were a Dell precision 3520 having a Xeon E3-1505M V6 processor and 8 GB of RAM, and a Thinkpad T430 having an Intel I7-3520M processor with 8 GB of RAM. The master and the evaluation USRPs used a Thinkpad with similar specs as host. We start by showing the residual frequency offset results followed by the residual timing offset results.
Carrier Synchronization Results
The residual CFO between the two slaves was calculated based on measurements captured from the two slaves using the evaluation USRPs. Figure 8 shows the histogram of the measured RCFO with each bin having a width of 2.5 Hz. The mean, standard deviation, and tail probability of the residual error are summarized in Table 1 . Due to the fact that phase synchronization, i.e., intra-group channel estimation, is not implemented, we use the results of [29] as a benchmark where the mean residual frequency error is provided in the table.
Timing Synchronization Results Figure 9 shows the histogram of the residual timing offset between the master and one of the slaves. Each bin has a width of T s /8, meaning the center bin represents residual error within ±T s /16. The mean, standard deviation and tail probability of the residual error in terms of sample duration T s are summarized in Table 2 . A comparison with benchmark results from [17] and [15] is also included. Note that [17] does not intend to adjust CFO and the STO synchronization accuracy can be affected by an actual multipath environment due to its fully wireless setting. Works [15] and [16] are tested in a wired environment where the carrier is perfectly synchronized among radios.
DISCUSSION
The performance results for timing and frequency synchronization show that the synchronization system achieved the requirements for a typical distributed beamforming scenario with the parameters described in Section 7. These results are comparable to performance that can be obtained using a GPSDO as was mentioned in Section 6, without the added cost of the GPSDO module or the limitations of GPS signals. By using an EKF over a periodic preamble, we get higher accuracy than what is achievable without the EKF as calculated from the CRLB (14) . The achievable accuracy in frequency adapted to retain robustness and ahieve the requirements of a specific waveform in a multipath environment. One possible way to extend this work is by porting the implementation from the host PC to the FPGA. This will enable us to reduce the required latency and to deploy the system using higher sample rates.
CONCLUSION
In this work, we have developed and implemented a joint carrier frequency and sample timing synchronization algorithm and protocol for cooperative communication in distributed arrays. Using the received baseband signal and proposed digital signal processing techniques, different radios with separate reference clocks can achieve less than 5 Hz residual frequency offset with 75 percent. The residual timing offset is within 1/16 of symbol duration of preamble by 68 percent. Our simulations show that this specification provides near optimal signal power gains when distributed beamforming is used for range extension applications in a typical setting.
APPENDICES A. NOMENCLATURE
Notations in the main context are summarized in the Table 2 . The value of parameters used in USRP implementation, when applicable, is also provided. 
