Abstract-Conventional orthogonal frequency division multiplexing (OFDM) communication systems are typically designed assuming additive white Gaussian noise and interference statistics. However, in many applications, such as Wi-Fi and powerline communications (PLC), impulsive statistics are often observed. Impulsive noise can degrade the signal-to-noise ratio (SNR) of all subcarriers and impair communication performance. In this work, we design and implement a real-time OFDM receiver with approximate message passing (AMP) to estimate and mitigate impulsive noise. The goal is to meet throughput and latency requirements while guaranteeing improved communication performance in impulsive noise. Our contributions include (i) modeling functional parallelism in an AMP OFDM receiver in synchronous dataflow, (ii) converting an AMP OFDM PLC receiver to using only fixed-point data and arithmetic, and (iii) mapping the receiver in fixed-point onto a Field Programmable Gate Array (FPGA) target using a high-level graphical synthesis tool. Our FPGA OFDM transceiver testbed achieves full streaming throughput at G3-PLC rates and recovers up to 8 dB SNR of impulsive noise over a wide SNR range.
I. INTRODUCTION
One of the most widely accepted (and thus used) communication channel models consists of a linear time-invariant filter with additive noise. This noise is typically modeled as an additive, white Gaussian noise (AWGN) process with independent, identically-distributed (iid) samples. Though this model accurately reflects thermal noise present in communication system electronics, it fails to capture well-known empirical and physical model-derived noise and interference statistics.
Extensive measurement campaigns taking place over the last several decades have revealed that the statistics of this additive component are in fact impulsive, with spurious components sometimes reaching 40 dB above background noise levels [1] - [3] . Physical interference models, based on the reality of increasingly interference-dominated cellular and wireless access networks, have further reinforced these results [4] . Using a revised noise model, communication systems can be redesigned for robustness and achievable rate in the presence of impulsive noise [5] , [6] . In this work, we take this approach to design and implement a real-time OFDM transceiver featuring an impulsive noise estimation and mitigation block (see Fig. 1 ). Our implementation is realized in hardware using field-programmable gate arrays (FPGAs). One application that suffers from strong impulsive noise is powerline communications (PLC). PLC systems operate by coupling modulated signals onto wires and transmission lines designed for electrical power delivery. These signals are subject to strong impulses and transient disturbances caused by the multitude of switching devices connected to the power grid. Methods for impulsive noise mitigation applicable to PLC include low-SNR techniques such as nulling and thresholding [5] and more robust, wide-SNR techniques based on sparse (many samples near zero) reconstruction [6] , [7] . In this work, we focus on approximate message passing (AMP), a sparse reconstruction technique shown to have high reconstruction performance while being readily parallelizable and scalable using mostly scalar arithmetic. We target our application to G3-PLC, a widely-adopted baseband OFDM PLC standard (see Table I ). FPGA processing is used in our design in order to deterministically achieve G3-PLC rates.
II. IMPULSIVE NOISE MITIGATION BY AMP
As demonstrated by Caire [6] , OFDM systems can exploit the impulse spreading property of the discrete Fourier transform (DFT) to sample impulsive noise in the frequency domain. These samples can be applied in a compressed sensing framework to reconstruct time-domain noise. These estimates can be subtracted from incoming samples to reduce impulsive noise prior to equalization and decoding (see Fig. 1 ). AMP has been formulated as a low-complexity, scalable compressed sensing technique, which leads itself to efficient hardware implementation for signal reconstruction [8] - [10] . Recently, Nassar et al. proposed an AMP framework for joint impulsive noise mitigation and channel estimation in OFDM systems [11] . This framework exploits the spreading of impulsive noise information across subcarriers by virtue of the DFT. Frequency-domain noise is used to reconstruct the timedomain noise using AMP. We apply this method using noise samples available in G3-PLC null subcarriers. As referenced in Table I , G3-PLC's baseband signaling has 92 such null subcarriers, of which we use 64 (subcarriers 1-22, 59-100).
B. OFDM Receiver with Message-Passing
An OFDM system with cyclic prefix diagonalizes the circulant channel matrix H using the DFT/IDFT matrices F/F * :
Let the noise vector n consist of an impulsive and background component-i.e., n = x+b. Samples are drawn from an i.i.d. two-mode Gaussian mixture distribution where γ X and γ B denote the impulse and background noise power (γ X γ B ):
The AMP algorithm takes as inputs the samples of the null subcarriers y Ωi from the set Ω where |Ω| = M < N. N is the total number of subcarriers and thus the FFT size of the OFDM system. The process of sampling null subcarriers can be implemented efficiently by indexing the corresponding entries of the FFT output. We express this as I Ω , a subset of the Ω rows of the N × N identity matrix I. Table II outlines the explicit computational steps of the AMP formulation for this system model. Detailed derivations are provided in [11] . The inputs to this algorithm are the Mlength vector of null subcarrier samples y Ω , background and impulsive noise variance (γ B , γ X ), and impulse probability π. The output is the N -length time-domain estimate of the impulsive noisex j (t + 1), where t denotes iteration count.
The first step of AMP is an initialization step. The second, or output linear step, involves the forward FFT operation where a ij denotes the ij th entry of the transform matrix, the DFT in this case. Our target architecture, the Xilinx Virtex-5 FPGA, implements the forward FFT as
N , where ω N = e −j2π/N . Our scaling by N and M factors reflect this.
Step 3 is the output non-linear step which consists of simple scalar operations.
Step 4, the input linear step, involves the IFFT.
Step 5 requires the highest depth of sequential operations, the most prohibitive of which is the exponential since its output has a high dynamic range. This was averted using an approximation discussed in Section V. Steps 2-5 are repeated until convergence. In our implementation, we use 4 iterations to balance performance with throughput constraints. Step Calculation 
III. SYNCHRONOUS DATAFLOW MODEL
The streaming operation of the AMP-enhanced OFDM receiver can be modeled in synchronous data flow (SDF) as shown in Fig. 2 . Nodes are application tasks and edges are first-in first-out (FIFO) queues that represent data dependencies. Each task produces and consumes a fixed number of samples in each execution. The tasks in Fig. 2 correspond to (A) sample rate conversion, (B) time and frequency offset correction, (C) FFT and CP removal, (D) AMP noise estimation, (E) FFT, (F) noise subtraction, (H) channel estimation, and (I) channel equalization. The deterministic properties of SDF enable fully-static analysis and synthesis of the system behavior. In particular, a periodic schedule can be determined by static analysis. For example, a periodic schedule for the SDF graph in Fig. 2 is (278A)(278B)CDEFHI. In Section V.B, we convert this SDF to a globally asynchronous, locally synchronous (GALS) model of computation using LabVIEW DSP Design Module. GALS affords more flexibility in hardware timing while still providing benefits of SDF.
IV. FIXED-POINT MAPPING
Before proceeding with the hardware implementation, the AMP impulsive noise mitigation algorithm in Table II needs to be mapped from floating-point to fixed-point. This typically involves determining the dynamic range of each variable and assigning an appropriate fixed-point representation of it that guarantees a specified level of performance. Although inrange data values, or the dynamic range captured by the fixed-point representation, correlates to performance, more complicated dependencies on variable sizings can be observed. To address this, we simulated the algorithm using the fi and NumericScope data types, both part of MATLAB's fixedpoint toolbox, as follows: (i) for each run of the algorithm we log the variables values; (ii) then, using the NumericScope we appropriate size these variables; (iii) finally, using the fi data type we simulate the algorithm with the variable sizing found in step (ii) and make sure we meet the target performance. The above procedure was performed for a fixed set of input parameters and input noise statistics-i.e., γ B = 0.0025, γ X = 0.25, and π = 0.1. The performance metric for the MATLAB simulation was the reduction in impulsive noise power over a large number of trials using simulated noise with matched parameters. Our fixed-point mapping was able to perform within 0.5 dB of the double precision floating-point MATLAB version of the algorithm.
The resulting variable sizings for the AMP implementation are shown in Table III . Our fixed-point data type representation convention is SN W .N I , where S is replaced by 'U' for unsigned and 'I' for signed, N W is the wordlength in bits, and N I is the integer width (or scaling by 2 NI ). The fractional width is N W − N I . By targeting 90% or higher in-range data values, most variables were able to be sized within 16-bit wordlengths, an efficient size to utilize the single-cycle digital signal processing (DSP) 48 blocks in the Virtex-5 FPGA.
V. FPGA IMPLEMENTATION
An OFDM transceiver using the fixed-point version of the AMP algorithm was implemented across several Xilinx Virtex-5 FPGAs. The intended hardware mapping was to one FPGA transmitter and one FPGA receiver. After initial sizing estimates, the AMP algorithm and the OFDM receiver were not able to fit within a single FPGA. The receiver was partitioned across two FPGAs that share data across the PXI-Express (PXIe) bus, an instrumentation-targeted version of the PCI-Express bus developed by National Instruments. Hardware mapping for the three FPGAs is detailed in the following section and presented graphically in Fig. 3 .
The OFDM transmitter interleaves modulated 2 × I8.1 complex symbols with a conjugate-symmetric pair of data and reference symbols. Reference symbols are encoded as quadrature phase shift keying (QPSK) signals and are interleaved between every 5 data subcarriers. These symbols are used for channel estimation at the receiver. Conjugate-symmetry is enforced in order to make the output purely real-valued, and coherent QPSK is used as the data subcarrier modulation. The receiver consists of two FPGAs. The first FPGA, 'G3RX', performs front-end processing-i.e. resampling and time/frequency synchronization. Resampling is performed using a multi-rate, two-stage finite impulse response (FIR) filter whose coefficients were designed using a LabVIEW multi-rate filter design tool. Synchronization is performed by correlating the signal with a delayed version of itself. Since the cyclic prefix (CP) is periodic, its autocorrelation exhibits a peak at a lag equal to the the symbol duration. The location of this peak and its resulting phase can be used for time and carrier frequency offset synchronization [12] . Frequency offset correction is performed using a dicrete cosine generator synthesized to 0.1 Hz phase increments using Xilinx CoreGen. The second receiver FPGA, 'AMPEQ', performs the AMP algorithm, channel estimation, and equalization. The AMP algorithm uses frequency-domain null subcarrier samples and parameters as input from 'G3RX'. Reference symbols are deinterleaved and used for channel estimation. Equalization is performed using a zero-forcing (ZF) equalizer.
In parameterizing the AMP algorithm for G3-PLC signaling, M and N are chosen to be 128 and 256, respectively. M = 128 instead of 64 (the number of null subcarriers) because we implement the full complex version of AMP based on 256-length complex-valued FFTs/IFFTs, whereas G3-PLC uses a 256-length real-valued FFT. In our implementation, we use null subcarriers 1-22 and 59-100 and their negative frequency pairs 234-255 and 156-197. The exponential function in Step 5 is approximated using a sixth-order Taylor series expansion. This was done to avert dynamic range issues when rescaling the output of the Xilinx CORDIC computation, which requires the inputs be normalized to [0, 1]. The ρ j calculation in Step 5 in Table II involves a 16-bit division 256 times per iteration. This operation was parallelized into two streams, resulting in a 8656 cycle reduction in execution time at the expense of a small increase in resource utilization. Fig. 3 shows the system block diagram of the AMPEnhanced OFDM transceiver testbed. The testbed consists of two National Instruments PXIe-1082 chasses, one for the TX and one for the RX. Both chasses make use of an onboard PXIe-8133 1.73 GHz Quad-Core (Intel Core i7-820QM) PXI Express controller running a real-time operating system. These PXI controllers have been targeted to the deployment of LabVIEW Real-Time (RT) applications which can execute deterministic dataflow computations at granularity on the order of 1 ms. Each of the two systems communicates via gigabit Ethernet back to a host PC for high-level, non-deterministic performance analysis and visualization. Each of the two RT chasses are configured as follows:
A. Target Architecture and System Mapping
• TX chassis is fitted with a single NI PXIe-7965R FlexRIO FPGA Module named 'G3TX' featuring a Virtex-5 SX95T FPGA that is used for interleaving data/reference symbols, OFDM modulation (IFFT), appending cyclic prefix (CP), and upsampling to the 10 MS/s NI-5781 digital-to-analog converter (DAC) sample rate. The host RT chassis generates random data symbols to feed the input symbol direct memory acces (DMA) first-in, first-out (FIFO) buffer. Modulated and upconverted samples are clocked out of the FlexRIO and into the NI-5781 adapted module. Theses samples are then passed across two micro-coaxial (MCX) 50Ω differential pairs to the RX NI-5781 (although the output of the quadrature component is always zero by virtue of the conjugate-symmetry enforced in 'G3TX').
• RX chassis is fitted with two NI PXIe-7965Rs. The first FPGA module, 'G3RX', is configured for the front-end receiver processing-i.e. downsampling, synchronization, frequency-offset estimation, OFDM demodulation (FFT), CP removal, and noise injection. The second FPGA module, 'AMPEQ' is passed frequency-domain samples from 'G3RX' using a peer-to-peer (P2P) stream over the PXIe backplane. These samples, in addition to parameters γ X , γ B , and π are passed from the RX host controller and are used as inputs to the AMP algorithm.
B. Synthesis Using LabVIEW DSP Design Module (LVDDM)
LVDDM is a high-level FPGA synthesis tool developed by National Instruments. LVDDM takes as an input a highlevel data flow diagram of DSP computations (see Fig. 4 ) and translates them to a LabVIEW FPGA register-transfer level representation. This representation is then mapped to a Verilog hardware description language product that is input to the Xilinx Integrated Software Environment FPGA compiler. LVDDM provides direct interfaces to Xilinx CoreGen features, allowing for the implementation of highly-optimized FPGA constructs including FFTs, discrete fourier transforms, direct digital synthesizers, and multi-rate FIR filter implementations.
The project files including the MATLAB fixed-point sizing simulation, LabVIEW host virtual instruments (VIs), DSP Diagrams, and generated FPGA bitfiles are available for download on the project webpage: http://users.ece.utexas.edu/~bevans/papers/2013/fpgaReceiver/
C. Resource Utilization
The resource utilization and master clock (MCLK) of each of the three FPGAs used in the test system is shown in Table  IV . Resource utilization is highest in 'AMPEQ', the FPGA with AMP processing and equalization. Overall resource usage could be reduced through further computation and buffer size optimization. Additionally, the use of real-vs. complexvalued FFTs could afford at least 2x reduction in resource utilization. Complex form was used so that the algorithm could be applied to other projects using complex signaling. To evaluate the real-time performance of this implementation, a bit-error-rate (BER) testbench was constructed in LabVIEW RT running on each embedded PXI controller. Equalized symbols, input impulsive noise, error vector magnitude (EVM), and trailing bit-error-rates (BER) are plotted for both the AMP-enhanced and conventional OFDM receiver (see top of Fig. 4 ). Blocks of 6 OFDM symbols were processed at a time, though this number can be varied by setting a control in the transmit and receive VIs. Using this framework, a test array of input and output data bytes were transmitted accross the OFDM link using synthesized Gaussian mixture noise input to the receiver. Received symbols were demodulated and compared for errors. Using the test environment, the receiver was configured with a fixed π, γ X = 0.25, and γ B was set to be a multiple of γ X listed in Fig. 5 . Transmit gain was swept over a 20 dB range in order to profile BER vs. SNR. Here, SNR is the transmit power over background noise power, γ B .
The results of the BER analysis are shown in Fig. 5 . As shown in the figure, by mitigating impulsive noise with AMP, SNR levels of up to 8 dB are recovered and gains are realized over a wide SNR range. The AMP and conventional receivers have nearly identical BER curves in non-impulsive, or π = 0, noise. Injected noise was synthesized using a match to AMP input parameters; however, estimation techniques discussed in [11] could be used to learn the parameters adaptively using amortization over OFDM symbols. These results indicate that this algorithm could be targeted for implementation in other OFDM receivers beyond G3-PLC systems.
VII. CONCLUSION . BER plots of received symbols with and without AMP algorithm for impulse-free noise and for impulsive noise with π = 0.05 (5% probability of impulse) using impulses 20 and 30 dB above the background level.
