Fifth-generation (5G) cellular systems will build on massive multi-user (MU) multiple-input multiple-output (MIMO) technology to attain high spectral efficiency. However, having hundreds of antennas and radio-frequency (RF) chains at the base station (BS) entails prohibitively high hardware costs and power consumption. This paper proposes a novel nonlinear precoding algorithm for the massive MU-MIMO downlink in which each RF chain contains an 8-phase (3-bit) constantmodulus transmitter, enabling the use of low-cost and powerefficient analog hardware. We present a high-throughput VLSI architecture and show implementation results on a Xilinx Virtex-7 FPGA. Compared to a recently-reported nonlinear precoder for BS designs that use two 1-bit digital-to-analog converters per RF chain, our design enables up to 3.75 dB transmit power reduction at no more than a 2.7× increase in FPGA resources.
I. INTRODUCTION
Fifth-generation (5G) cellular communication systems are widely expected to rely on massive multi-user (MU) multipleinput multiple-output (MIMO) technology to achieve significant improvements in spectral efficiency compared to existing small-scale MIMO systems [2] - [4] . MU-MIMO equips the base station (BS) with hundreds of antennas and radiofrequency (RF) chains, enabling one to simultaneously serve tens of user equipments (UEs) in the same time-frequency resource via fine-grained beamforming. Unfortunately, scaling conventional multi-antenna BS architectures (that use highprecision RF chains) to BSs with hundreds of antenna elements entails a significant increase in system costs and circuit power consumption. Hence, to make massive MU-MIMO systems inexpensive and power-efficient, novel BS architectures and suitable baseband-processing algorithms are necessary.
Low-Precision BS Architechures: The use of low-precision digital-to-analog-converters (DACs) at the BS in the massive The precoding algorithm and architecture proposed in this paper builds upon the one proposed in [1] for C2PO; in contrast to these results, the present paper uses a modified architecture with a more sophisticated projection unit.
A MATLAB simulator for the precoder proposed in this paper is available on GitHub: https://github.com/quantizedmassivemimo/3bit CM precoding.
OC and CS were supported in part by Xilinx Inc. and by the US NSF under grants ECCS-1408006, CCF-1535897, CAREER CCF-1652065, and CNS-1717559. The work of SJ and GD was supported in part by the Swedish Foundation for Strategic Research under grant ID14-0022, and by the Swedish Governmental Agency for Innovation Systems (VINNOVA) within the competence center ChaseOn. SJ's research visit at Cornell was sponsored in part by Cornell's College of Engineering. TG was supported by the US NSF under grant CCF-1535902 and by the US ONR under grant N00014-15-1-2676. MU-MIMO downlink enables significant reductions in terms of system costs and circuit power consumption. The key challenge with such low-precision BS architectures is to maintain high spectral efficiency, which requires sophisticated basebandprocessing algorithms. While linear precoders, e.g., maximalratio transmission (MRT) and zero-forcing (ZF), followed by quantization exhibit low complexity [5] - [7] , sophisticated nonlinear precoders can achieve superior performance, especially for the extreme case of using only a pair of 1-bit DACs per RF chain [8] - [13] . Recently, reference [1] presented VLSI designs of nonlinear precoders for systems that use a pair of 1-bit DACs per RF chain, which demonstrates that nonlinear precoding is feasible in practice for such low-precision BS architectures.
The use of 1-bit DACs at the BS ensures that the precoded signal has constant-modulus (CM), i.e., the precoded signal's amplitude is equal on all antennas and constant over time, which enables the use of low-cost and power-efficient analog circuitry, such as nonlinear power amplifiers. Recently, nonlinear precoders for 8-phase (3-bit) CM transmitters, i.e., the setup considered in this work, were proposed in [14] , [15] . It remains, however, an open question whether the algorithms proposed in [14] , [15] can be implemented efficiently in hardware.
Contributions: This paper develops a novel nonlinear precoding algorithm in which each RF chain contains an 8-phase (3-bit) CM transmitter that enables efficient analog circuitry while surpassing the error-rate performance of systems that use a pair of 1-bit DACs (i.e., 4 phases) per RF chain. We propose a nonconvex algorithm to solve the associated 8-phase (3-bit) CM precoding problem in an efficient manner, and we develop a VLSI architecture that uses a fast matrix-vector multiplication engine based on Cannon's algorithm [16] . We show Xilinx Virtex-7 FPGA implementation results and provide a comparison with the C2PO precoder proposed in [1] .
II. SYSTEM MODEL AND CM PRECODING

A. System Model
We consider the single-cell, narrowband massive MU-MIMO downlink system shown in Fig. 1 . Here, the BS, which is equipped with B antennas, serves U ≤ B single-antenna UEs. The narrowband downlink channel is modeled by y = Hx + n, where y = [y 1 , . . . , y U ] T ∈ C U contains the received signals at all UEs, H ∈ C U ×B is the channel matrix (which we assume is known to the BS), n ∈ C U models i.i.d. circularly-symmetric complex Gaussian noise with variance N 0 per complex entry, and x ∈ X B is the so-called precoded vector, where X is the transmit alphabet. In this work, we require that X has finite cardinality and that the entries of X have CM. Specifically, the CM alphabet is X = {exp(j2πp/P ) | p = 0, . . . , P − 1} where P denotes the number of phases and log 2 (P ) the number of bits per RF chain. The CM constraint ensures that x 2 2 = B.
B. Constant-Modulus (CM) Precoding
The precoder at the BS maps the symbol vector s = [s 1 , . . . , s U ] T into the precoded vector x ∈ X B . Here, s u ∈ O is the constellation point intended for the uth UE (u = 1, . . . , U ), where O is the constellation set (e.g., 16-QAM). We assume that each UE u = 1, . . . , U rescales its received signal y u by a factor β u ∈ C to compute an estimateŝ u = β u y u of the transmitted symbol s u . Nonlinear precoders that minimize the mean-squared error (MSE) between the transmitted and the estimated symbols solve the following optimal precoding problem (OPP) [1] :
Here, we assume that β = β u for u = 1, . . . , U ; as shown in [9] , the UEs are able to accurately learnβ. For systems that use a pair of 1-bit DACs per RF chain (P = 4 phases), methods that solve (OPP) approximately using convex [8] , [9] and nonconvex [1] relaxation have been proposed recently. In what follows, we present a novel precoder specifically designed for CM transmitters with 3 bits per RF chain (P = 8 phases), which enables significant error-rate performance improvements compared to systems with 2 bits per RF chain (P = 4 phases), without requiring complex RF circuitry.
III. C3PO: CONSTANT-MODULUS 3-BIT PRECODING
A. Relaxing the Problem (OPP)
To find an approximate solution to (OPP) via methods that can be implemented efficiently, we perform the following approximations. First, we let N 0 → 0, i.e., we assume that the system operates in the high-SNR regime. Then, we use the following approximation [1, Eq. These two approximations result in the following problem:
We next computeα by minimizing the objective function of (OPP * ), which results inα = s H Hx/ s 2 2 . Substitutingα in (OPP * ) yields α(x)s − Hx The factor 1/2 does not affect the solution of (OPP * * ). We now replace the finite-phase constraint x ∈ X B by the convex polytope surrounding the points X = {x p } P p=1 given by
. For 3-bit CM precoding, the boundary of the convex polytope B is a regular octagon (see Fig. 2 ). Unfortunately, solving (OPP * * ) over the relaxed set x ∈ B B yields the all-zeros vector. We therefore attempt to solve the following modified problem via forward-backward splitting (FBS) [17] - [19] :
where the concave regularizer − δ 2 x 2 2 with δ > 0 forces the solutionx to lie at the boundary of the convex polytope B B . As the problem in (1) is nonconvex, FBS is not guaranteed to converge to an optimal solution. Nevertheless, the algorithm proposed exhibits good empirical performance (see Sec. IV).
B. The C3PO Algorithm
FBS is an efficient numerical method to solve convex optimization problems whose objective function can be decomposed as f (x) + g(x), where the function f is smooth and convex, and the function g is convex but not necessarily smooth or bounded. FBS consists of the following iteration [17] , [18] :
for t = 1, 2, . . . , t max or until convergence. Here, the sequence {τ (t) > 0} contains suitably chosen step-size parameters and ∇f (x) is the gradient of the smooth function f , and the socalled proximal operator for the function g is defined by [20] prox g (z; τ ) = arg min
978-1-5386-4881-0/18/$31.00 ©2018 IEEE
To approximately solve (1) using FBS, we set
where χ is a characteristic function that is zero if x ∈ B B and infinity otherwise. For these choices, the gradient is given by ∇f (x) = A H Ax and the proximal operator is detailed in Sec. III-C. Furthermore, we use a constant step size τ = τ (t) . The resulting algorithm is as follows:
Algorithm 1 (C3PO). Initialize x (1) = H H s and fix the parameters δ and τ so that τ δ < 1. Then, for every iteration t = 1, 2, . . . , t max compute:
x (t+1) = prox g (z (t+1) ; τ ).
The prox g operator is applied element-wise to z (t+1) and detailed in Sec. III-C. In the last iteration t max , the output x (tmax+1) is quantized to the 3-bit CM alphabet X B .
The most costly operation of C3PO is the matrix-vector product in step (2), which we compute as: (2) is rewritten as follows:
C. Proximal Operator for 3-Bit CM Precoding
The proximal operator in (3) reduces to prox g (z; τ ) = proj( 1 1−τ γ z), where proj(·) projects each element of the argument to the closest point in the polytope B. For 3-bit CM precoding, the polytope is a regular octagon. Projecting a scalar z ∈ C onto an octagon is nontrivial so we focus on the first quadrant of the complex plane (see Fig. 2 ). If z is inside the octagon (in region A), then it remains there; if z is in the regions B, C, or D, then it will be mapped to j, 1 √ 2 (1 + j), or 1, respectively; if z is in the regions E or F, then it will be mapped to the closest point on the lines 1 or 2 , respectively.
To determine in which of the six regions A-F the argument is located, we use the equations for the lines that separate them:
The equations for the lines 2 , 5 , and 6 are identical to the ones of 1 , 4 , and 3 , but with (z) and (z) exchanged. Using these equations, we can project z onto the set B.
IV. VLSI ARCHITECTURE AND IMPLEMENTATION RESULTS
A. Architecture Overview
The proposed VLSI architecture is shown in Fig. 3 and builds upon the one of C2PO in [1] , which was designed for 2-bit CM precoding. As in [1] , we assume that B is a multiple of U , so the architecture consists of B/U linear arrays, each containing U + 1 processing elements (PEs). Each linear array operates on Fig. 3 . High-level block diagram of the VLSI architecture for C3PO. We use B/U linear arrays, each consisting of U + 1 processing elements (PEs). a (U + 1) × U sub-matrix of H and on a U -dimensional subvector of x (t) . The architecture computes step (2) simplified as in (4) via two separate matrix-vector products using Cannon's algorithm [16] . We first compute w = H(τ x (t) ) by cyclically exchanging the entries of τ x (t) between the PEs of the same array. We then compute z (t+1) = x (t) − H Υ w by cyclically exchanging the accumulated results of the PEs within the same array. Finally, the vector z (t+1) is fed to a projection unit implementing step (3), thus completing one C3PO iteration. The proposed architecture requires 2U +log 2 (B/U )+9 clock cycles for one C3PO iteration. See [1] for more architecture details.
Each PE is equipped with (i) anh u memory storing the uth row of the corresponding sub-matrix taken from H; (ii) a complex-valued multiply-accumulate (MAC) unit; and (iii) a projection unit. See [1] for details on (i) and (ii); part (iii), the projection unit, is more complicated than that of C2PO. Specifically, this unit maps the entries of z (t+1) to the first quadrant of the complex plane and perform comparisons based on the line equations 1 -6 (see Sec. III-C) in order to perform the projection of z (t+1) to B B .
B. Fixed-Point Parameters
The entries of x (t) use 14-bit signed values with 8 fraction bits. The entries of τ x (t) use 14-bit signed values with 13 fraction bits. The entries of H use 11-bit signed values with 8 fraction bits and are stored in look-up tables (LUTs) used as distributed RAM. The complex-valued MAC units use 18-bit signed values with 15 fraction bits when computing w; 11 fraction bits are used when calculating z (t+1) . The adder tree uses 21 bits with 15 fraction bits. The projection unit represents the constants (e.g., 1 − √ 2 and its reciprocal) using signed values with 4-5 bits, so no multipliers are used in the operations related to lines 1 -6 . A total of 30 adders and subtractors are used within each projection unit; these components operate signed numbers with 7 fraction bits; the total bit-width varies between 14-15 bits, depending on the quantity. Fig. 4(a) and Fig. 4(b) show uncoded bit-error rate (BER) as a function of the normalized transmit power = B/N 0 for different precoding algorithms and U = 16 UEs. Fig. 4(a) shows the BER for B = 32 BS antennas and BPSK; Fig. 4(b) , for B = 256 BS antennas and 16-QAM. The simulation results are for 10, 000 Monte-Carlo trials and i.i.d. Rayleigh fading channels. Both C2PO and C3PO run with t max = 9. For 978-1-5386-4881-0/18/$31.00 ©2018 IEEE reference, we show the BERs with 3-bit CM MRT-quantized (MRT-Q) and ZF-quantized (ZF-Q) precoding, as well as the BERs with MRT ("Inf. prec. MRT") and ZF precoding ("Inf. prec. ZF") with infinite-precision DACs. We see from Fig. 4(a) and Fig. 4(b) that the nonlinear precoders (C2PO and C3PO) significantly outperform MRT-Q and ZF-Q at high normalized transmit power . Furthermore, compared to C2PO, we note that C3PO enables a 3.75 dB gain (in terms of ) at 1% uncoded BER for B = 32 and BPSK, and 1.75 dB for B = 256 and 16-QAM. Finally, we note that the implementation loss of our hardware designs (shown with blue markers) is negligible, i.e., less than 0.15 dB at 1% uncoded BER. Table I shows FPGA implementation results for 2-bit CM MRT-Q [1], C2PO [1] , and C3PO. All designs were developed using Verilog, and implemented using Xilinx Vivado Design Suite for a Xilinx Virtex-7 XC7VX690T FPGA. The designs support U = 16 UEs and were implemented for B = {32, 64, 128, 256}. Table I reveals that the resources of all designs increase roughly linearly with B. MRT-Q achieves the highest throughput thanks to its simplicity, which comes at the cost of a poor uncoded BER performance. C2PO uses ∼1.4× more LUTs than MRT-Q and requires increased latency and critical path. Compared to C2PO, C3PO consumes ∼ 2.6× the number of slices and LUTs, ∼ 2× the number of flip-flops, and the same number of DSP48s. This difference is caused by the 3-bit CM projection unit, which also increases the latency with its pipeline registers. However, C3PO can significantly outperform C2PO in terms of BER (cf. Fig. 4(a) and Fig. 4(b) ). Fig. 4(c) shows the performance-complexity tradeoffs of C2PO and C3PO: the complexity is represented by the minimum normalized transmit power that is required to achieve 1% uncoded BER for BPSK; the performance, by the throughput. The tradeoffs show systems with BPSK, U = 16 UEs and B = {32, 64, 128, 256} BS antennas. As a reference, the minimum transmit power required for infinite-precision ZF precoding to achieve 1% uncoded BER is shown as a vertical line. We see from Fig. 4 (c) that, while C2PO is able to achieve higher throughput than C3PO, C3PO requires lower transmit power to achieve 1% uncoded BER. This difference increases for small array sizes: for a system with B = 32, 4 iterations of C3PO achieve 1% uncoded BER at = 8 dB while C2PO is unable to achieve 1% uncoded BER at such value of .
C. Error-Rate Performance
D. FPGA Implementation Results and Comparison
E. Performance/Complexity Tradeoffs
V. CONCLUSIONS
We have proposed a nonlinear precoder for 8-phase (3-bit) CM transmission, C3PO, which builds upon the 4-phase C2PO precoder [1] . By using a different projection unit and no more than 2.7× higher FPGA resources, C3PO achieves up to 3.75 dB transmit power reduction, and thus, low uncoded BERs in scenarios for which C2PO exhibits poor error-rate performance.
