Channel estimation errors have a critical impact on the reliability of wireless communication systems. While virtually all existing wireless receivers separate channel estimation from data detection, it is well known that joint channel estimation and data detection (JED) significantly outperforms conventional methods at the cost of high computational complexity. In this paper, we propose a novel JED algorithm and corresponding VLSI designs for large single-input multiple-output (SIMO) wireless systems that use constant-modulus constellations. The proposed algorithm is referred to as PRojection Onto conveX hull (PrOX) and relies on biconvex relaxation (BCR), which enables us to efficiently compute an approximate solution of the maximum-likelihood JED problem. Since BCR solves a biconvex problem via alternating optimization, we provide a theoretical convergence analysis for PrOX. We design a scalable, highthroughput VLSI architecture that uses a linear array of processing elements to minimize hardware complexity. We develop corresponding field-programmable gate array (FPGA) and application-specific integrated circuit (ASIC) designs, and we demonstrate that PrOX significantly outperforms the only other existing JED design in terms of throughput, hardware-efficiency, and energy-efficiency.
• We theoretically analyze the convergence of PrOX, which solves a biconvex problem via alternating optimization. • We introduce an approximation that significantly reduces the preprocessing complexity of PrOX at virtually no loss in terms of error-rate performance. • We provide simulation results that showcase the robustness of PrOX to a broad range of system parameters. • We develop a scalable VLSI architecture for PrOX that uses a linear array of processing elements (PEs) to achieve high-throughput at low hardware complexity. • We show reference FPGA and ASIC implementation results, and compare our designs to that of the recentlyreported JED implementations in [17] . Our results demonstrate that PrOX provides near-optimal JED at significantly lower complexity than existing reference designs.
C. Related Relevant Results
While a considerable number of algorithms and VLSI designs have been proposed for small-and large-scale multiantenna wireless systems that separate channel estimation and data detection (see, e.g., [8] , [17] , [19] and the references therein), only a few results have been proposed for JED. Sphere-decoding (SD) algorithms have been proposed to perform exact maximum-likelihood (ML) JED in SIMO and MIMO systems that use a small number of time slots [9] - [12] , [20] , [21] . Unfortunately, the complexity of SD methods quickly becomes prohibitive for larger dimensional problems [22] , [23] and approximate linear methods, which are widely used for coherent data detection in massive MIMO systems [8] , cannot be used for JED (see Section II-B for the reasons). Very recently, a handful of approximate JED algorithms have been proposed for large wireless systems [17] , [24] - [26] . To the best of our knowledge, reference [17] describes the only VLSI design of a JED algorithm reported in the open literature. While this design achieves near-ML-JED performance, the algorithm relies on semidefinite relaxation (SDR), which lifts the dimensionality of the problem to the square of the number of time slots causing high hardware complexity. In contrast to all these results, PrOX avoids lifting while achieving near-ML-JED performance and can be implemented efficiently in VLSI, even for scenarios that operate with a large number of time slots.
Another line of related work investigates data detection algorithms that are robust to imperfect CSI [27] , [28] . The idea is to statistically model channel estimation errors and to solve suitably adapted data detection problems. The resulting algorithms, however, (i) do not achieve the error-rate performance of JED as they are not taking into account all the received information (i.e., from training and data symbols received over multiple time slots) and (ii) do not improve the channel estimate itself-the latter is critical in TDD systems that perform beamforming in the downlink using acquired CSI in the uplink [2] - [5] . In contrast, we propose JED algorithms that (i) approach optimal error-rate performance as they jointly process all received information and (ii) generate improved channel estimates that can be used for beamforming.
D. Notation
Lowercase and uppercase boldface letters stand for column vectors and matrices, respectively. For the matrix A, the Hermitian is A H and the kth row and th column entry is A k, . For the vector a, the kth entry is a k . The Euclidean norm of a, the Frobenius norm, and the spectral norm of A are denoted by a 2 , A F , and A , respectively. The real and imaginary parts of the vector a are (a) and (a), respectively.
E. Paper Outline
The rest of the paper is organized as follows. Section II describes the system model as well as ML-optimal JED. Section III introduces the PrOX algorithm and provides a theoretical convergence analysis. Section IV details the VLSI architecture for PrOX. Section V shows error-rate performance and VLSI implementation results, and compares PrOX to existing FPGA and ASIC designs. We conclude in Section VI.
II. SYSTEM MODEL AND ML-OPTIMAL JED
We now introduce the considered SIMO system model and develop the associated ML-JED problem.
A. System Model
We study a (potentially large) SIMO wireless uplink system in which a single-antenna user transmits data over K + 1 time slots to a BS with B antennas. We consider the standard blockfading, narrow-band 1 channel model with the following inputoutput relation [9] - [11] , [20] :
Here, the matrix Y ∈ C B× (K +1) contains the B-dimensional receive vectors for the K + 1 time slots, i.e., Y = [y 1 , y 2 , . . . , y K +1 ] with y k ∈ C B and k = 1, 2, . . . , K + 1, h ∈ C B is the unknown SIMO channel vector (assumed to remain constant over the K +1 time slots), s ∈ O K +1 contains the transmitted data symbols for all K + 1 time slots, and N ∈ C B× (K +1) models i.i.d. complex circularly-symmetric Gaussian noise with variance N 0 per entry. In the remainder of the paper, we assume constant-modulus constellations, i.e., |s| = σ for all s ∈ O and a fixed σ > 0. Remark 1: The SIMO case is relevant in many rural deployment scenarios in which only a few users are active at a time [30] . The assumption of having constant-modulus constellations limits our results to low-rate modulation schemes, such as BPSK and PSK constellations. While the (multiuser) MIMO case and higher-order QAM modulation schemes would be of interest in more general deployment scenarios, the associated ML-JED problem is significantly more challenging to solve, and requires different, more complex, and complicated algorithms; see, e.g., [12] , [21] for more details-we are planning to address this scenario in the future.
B. Joint Channel Estimation and Data Detection (JED)
Let h be a deterministic-but unknown-channel vector with unknown prior statistics. Then, one can formulate the following ML-JED problem [11] :
This problem aims at simultaneously finding an estimate for the transmitted data vector s and the channel vector h. We emphasize that there is a phase ambiguity between both output vectors of ML-JED in (2) . Specifically, ifŝ JED e j φ ∈ O K +1 for φ ∈ [0, 2π), thenĥe j φ is also a valid solution to (2) . In order to avoid this phase ambiguity, one can either transmit information as phase differences among the entries in the vector s, which is known as differential signaling [24] , or fix the phase of at least one entry in the transmit vector s. As in [17] , we set the first entry of the transmit vector to a fixed constellation pointš ∈ O and exploit this knowledge at the receiver. Remark 2: While this approach resembles that of conventional, pilot-based data transmission, we emphasize that ML-JED is fundamentally different. In traditional, pilot-based data transmission, a small number of known training symbols are used to generate channel estimates, which are then used during the detection of the data symbols. In contrast, ML-JED uses all received symbols, i.e., pilots and data symbols, to improve the quality of CSI. We also note that ML-JED as in (2) jointly solves for the transmitted data vector s and the channel vector h; this is in contrast to JED methods that alternate between channel estimation and data detection (see, e.g., [13] - [16] ) and for which ML-JED optimality can, in general, not be guaranteed.
Since we assumed the entries in s to be constant-modulus, the ML-JED problem in (2) can be rewritten in the following, more compact form [11] :
and the associated channel estimate is given byĥ = Yŝ JED / ŝ JED 2 2 . We note that the optimization problem in (3) resembles the famous MaxCut problem that is known to be NP-hard [31] . Indeed, general problems of the form (3) are known to be NP-hard with respect to randomized reductions, even when only approximate solutions are sought [32] . Nevertheless, for a small number of time slots K + 1, the problem in (3) can be solved exactly and at low average complexity using sphere decoding (SD) methods [11] . For a large number of time slots, however, SD is known to entail prohibitively high complexity [33] . Furthermore, linear approximations are useless for JED, since the entries of s would grow without bound when relaxing the finite-alphabet constraint s ∈ O K +1 to the complex numbers s ∈ C K +1 . As a consequence, the design of non-linear and low-complexity algorithms is necessary to enable JED for practical SIMO systems that use a large number of time slots.
III. PROX: PROJECTION ONTO CONVEX HULL
We now develop a novel algorithm to approximately solve the ML-JED problem in (3) using biconvex relaxation (BCR) [18] , a recent framework to solve large semidefinite programs in computer vision. The resulting algorithm is referred to as PRojection Onto conveX hull (PrOX), requires low computational complexity, and achieves near-ML-JED performance.
A. Biconvex Relaxation (BCR) of the ML-JED Problem
We start from (3) and include a regularization term that forces the transmit vector s ∈ O K +1 to be close to a copy q ∈ C K +1 that is relaxed to the complex numbers as follows:
where α > 0 is a suitably chosen regularization parameter.
To ensure that the problem in (4) is convex in the vector q, the parameter α must be larger than the maximum eigenvalue of the Gram matrix G = Y H Y, i.e., α > Y H Y . We next relax the finite-alphabet constraint to the convex hull C O of the set O given by [34] 
with s i , i = 1, . . . , |O|. For example, the convex hull for BPSK constellations C BPSK is the line along the real axis in [−1, +1]; the convex hull C QPSK for QPSK is the square with the four constellation points as corners. By relaxing O to the convex hull C O , we arrive at the following relaxed ML-JED problem:
For the above problem, it is likely to obtain a solution that is inside the convex hull. We are, however, interested in finding solutions that lie at the boundary of the convex hull C O as the original constellation points in O lie at the boundary. To this end, we include a norm constraint that promotes large values in s, with the goal of pushing the entries of s towards the boundary of the convex hull. This leads to the final BCR formulation of the ML-JED problem:
Here, β is a suitably chosen algorithm parameter that satisfies β < α, which ensures that the optimization problem is biconvex in the vectors s and q. In words, for a fixed vector q, the problem in (5) is convex in s, and vice versa.
B. Alternating Optimization
We solve the BCR problem in (5) using alternating minimization, i.e., we keep one of the vectors fixed while solving for the other. The resulting iterative procedure is given by
where t = 1, 2, . . . , t max is the iteration counter. We initialize the above algorithm with s (0) =š(G 1,1 ) −1 g c 1 , where g c 1 is the first column of G, andš ∈ O is the fixed symbol transmitted in the first time slot and known at the receiver.
Since the problem in (5) is biconvex in s and q, both of the above steps are convex and can be solved optimally. Furthermore, each of these optimization problems turns out to have a closed form solution. Concretely, we obtain the following two-step PrOX algorithm:
where θ = α α−β > 0, I is the identity matrix, and the proximal operator [35] is defined as
This proximal operator is the element-wise orthogonal projection of the vector v = θ q (t ) onto the convex hull C O around the constellation set O. For BPSK, the proximal operator in (10) is
, the projection onto the real line in [−1, +1]; for QPSK, (10) corresponds to applying the same operation, independently, to both
In the case of (9), we have v = θ q (t ) with θ > 0 and hence, the proximal operator in (10) projects a scaled version of q (t ) onto the convex hull of the constellation set C O -this is why we dub our algorithm PRojection Onto conveX hull (PrOX).
C. Convergence Theory of PrOX
We now show that the PrOX algorithm is well behaved, even though it solves a biconvex problem using alternating optimization. More specifically, we begin by establishing that the iterates q (t ) and s (t ) generated by PrOX converge to stationary points of the objective function in (5) , provided that the algorithm parameters α and β are chosen appropriately. We then show that these stationary points lie on the boundary of the convex hull C K +1 O , which implies that our biconvex relaxation is reasonably tight. In what follows, we denote the objective function of PrOX in (5) as
Note that the factors of 1 2 do not affect the solution of PrOX. At an optimal solution, the sub-gradients of f with respect to both s and q should be zero. Because s (t ) is computed by minimizing f with respect to s (t ) , we already know that the sub-gradient of f with respect to s (t ) contains zero. We can thus measure the optimality of the iterates by examining only ∇ q f, the gradient of f with respect to q. We now show that this gradient vanishes for a large number of iterations t; the proof is given in Appendix A.
Theorem 1: Suppose that the parameters α and β satisfy α > G and α > β. Then, the PrOX algorithm in (8) and (9) converges in the gradient sense, i.e.,
Furthermore, any limit point of the sequence of iterates is a stationary point.
The PrOX algorithm is founded on the idea of replacing the discrete constellation O with its convex hull C O . This results in a biconvex problem that is easily solved without resorting to expensive discrete programming or lifting-based methods. However, in using a biconvex relaxation, there is a risk of finding a minimizer that lies somewhere in the interior of the convex hull C O , far away from any of the constellation points. We now provide conditions for which this situation does not happen. The following theorem shows that minimizers of (5) always lie on the boundary of the convex hull C O ; a proof is given in Appendix B.
Theorem 2: Let the conditions of Theorem 1 hold and further assume α > β > 0. Then, any non-zero stationary point of (5) lies along the boundary of the set C O .
Remark 3: Theorem 1 and Theorem 2 do not guarantee PrOX to find a global minimizer to the ML-JED problem (3), but rather stationary (or saddle) points that lie on the boundary of the convex hull-although we often observe global minimizers in practice (cf. Section V-A). The development of stronger results that provide conditions for which PrOX finds the optimal solution is challenging and left for future work.
D. Simplifying the Preprocessing Stage
For a large number of time slots K +1, computing the matrix inverse G = (I − α −1 G) −1 in (8) results in high preprocessing complexity. We now propose an approximate method that substantially reduces the preprocessing complexity for PrOX at a negligible loss in terms of error-rate performance.
Let G < α for any consistent matrix norm · . Then, we have the following Neumann series expansion [36] :
As put forward in [8] for data detection in massive MIMO systems, one can approximate a matrix inverse by truncating the series expansion to the first two terms, which is
Evidently, the expression on the right-hand side does not require the computation of a matrix inverse, which significantly reduces the preprocessing complexity. More specifically, the matrix to be inverted in the left-hand side of (12) is of dimension (K + 1) × (K + 1). The complexity (in terms of the number of real-valued multiplications) of an explicit matrix inversion using, e.g., the Cholesky factorization scales with (K + 1) 3 and exhibits stringent data dependencies when implemented in VLSI; see [8] for more details on the computational complexity of Cholesky-based matrix inversion and a VLSI design. Hence, for a large number of time slots K , the approximate preprocessing method in (12) yields significant savings in terms of hardware complexity.
We have the following result that bounds the error of the approximation in (12); the proof is given in Appendix C.
Theorem 3: Let α > G . Then, the error of the approximation in (12) is bounded by
This result implies that if G is smaller than α, then the approximation error will be small. In other words, the approximation error will depend on the parameter α, which we can tune in practice for optimal performance. In Section V-A, we will show that PrOX with this approximation results in excellent error-rate performance while significantly reducing the complexity compared to using the original matrix G.
E. Hardware-Friendly Variant of PrOX
As a last step, we now modify the PrOX algorithm in (8) and (9) to make it more hardware friendly. Since the BS is assumed to know the first entry of s, there is no need to apply the algorithm to this particular entry. Consequently, we force the entry s (t ) 1 to beš at the end of each iteration. The matrix G exhibits, in general, a large dynamic range for different system configurations and channel realizations. To facilitate fixed-point design, we divide all of its elements by a constant γ > 0, so that the entries of the resulting matrix are close to one in absolute value. This scaling procedure requires us to introduce a new vectorq = γ −1 q, resulting in the following modified PrOX algorithm:
where = θγ . With these two modifications, we arrive at the hardware-friendly version of PrOX summarized as follows:
Algorithm 1 (Hardware-Friendly PrOX): Fix the algorithm parameters > 0 and α > G . Precompute the matrix G using one of the following expressions:
and initialize s (0) =š(G 1,1 ) −1 g c 1 . Then, for every iteration t = 1, 2, . . . , t max , compute the following steps:
At the end of iteration t max , compute hard-output estimates of the transmitted signals as follows:
In what follows, we will use PrOX to refer to Algorithm 1 with exact preprocessing as in (15) and approximate PrOX (APrOX) to refer to Algorithm 1 with approximate preprocessing as in (16) . We also note that α > 0 and > 0 are algorithm parameters that can be tuned via numerical simulations to improve the error-rate performance.
IV. VLSI ARCHITECTURE
We now propose a VLSI architecture for PrOX, which exhibits high regularity and enables us to achieve high throughput at low hardware complexity. Figure 1 shows the VLSI architecture for PrOX in a system that uses QPSK 2 modulation. At a high level, our architecture consists of a linear array of N = K + 1 processing elements (PEs), each one dedicated to computing an entry of the vectors s andq. To execute Algorithm 1, each PE has three key components: (i) a G-matrix memory, (ii) a complexvalued multiply-accumulate (MAC) unit, and (iii) a projection unit (see the right side of Figure 1 ). The G-matrix memory comprises two memories to store the real and imaginary parts ofĝ r k , i.e., the kth row of the matrix G. We assume that G, which is the result of either (15) (for PrOX) or (16) (for APrOX), was computed and loaded in the memories during a preprocessing step. The MAC unit of the kth PE is used to sequentially compute the kth row of the matrix-vector product (MVP) in
A. Architecture Overview
Step (17) of Algorithm 1, resulting inq (t ) k (see Section IV-B for the details). Finally, the projection unit of the kth PE computes the kth entry of Step (18) of Algorithm 1, i.e., it usesq (t ) k to acquire s (t ) k . Since s (t ) 1 =š, the first PE does not need the previous three key components. Instead, PE 1 only contains two multiplexers and flip-flops that store the predefined constellation pointš in order to implement
Step (19) of Algorithm 1. The implementation of all other PEs is identical and uses the aforementioned three key components. For these PEs, the hard-output estimates in Step (20) are extracted directly from the sign bits at the outputs of the projection units.
B. Matrix-Vector Product (MVP)
Before explaining the operation details of the proposed VLSI architecture, we focus on the main computation of PrOX: the MVP operation in Step (17) of Algorithm 1. A straightforward way for computingq (t ) = Gs (t −1) using a linear array of PEs is depicted in Figure 2 k in a sequential manner over k = 1, . . . , N clock cycles. Here,ĝ c k represents the kth column of G. While such an approach is conceptually simple, it requires a centralized vector memory which suffers from a high fan-out at its output (highlighted with red color in Figure 2(a) ). More concretely, the vector memory output must be connected to N = K + 1 MAC units, eventually becoming the critical path for systems with a large number of time slots K + 1. While wire pipelining at the output of the vector memory could be used to mitigate the fan-out, we propose an alternative solution that avoids this issue altogether.
Consider the MVP architecture depicted in Figure 2 (b). In this architecture, we replace the centralized vector memory by a set of registers: each register is associated with a PE and contains an entry of the vector s (t −1) . Furthermore, the registers are chained together in a cyclic manner, forming a shift register that enables all entries of s (t −1) to reach all the PEs in a round-robin fashion. Then, each PE only needs to access the correct entry ofĝ r k to ensure that the accumulator of their MAC unit contains the correct MVP result after N clock cycles. By following this approach, each register only drives the next register in the cycle and one of the MAC unit inputs, regardless of N, hereby completely avoiding the fanout issue. We henceforth call this architecture input-cyclic MVP, which builds the core component of the PrOX architecture shown in Figure 1 and detailed in the next subsection.
We note that the reading pattern of the G-matrix memory for each PE corresponds to simply accessing the entries ofĝ r k in order. However, in contrast to the conventional architecture in Figure 2 (a), reading can start from any element and wraps around after reaching the end ofĝ r k . Instead of using address generation logic required to implement this behavior, we store a cyclically-permuted version ofĝ r k in the G-matrix memory of each PE. By doing so, each PE simply needs to access its G-matrix memory in regular order starting from the first address, as it would be done in the conventional architecture depicted in Figure 2 (a).
C. Operation Details of the PrOX Architecture
To implement the input-cyclic MVP architecture, the entries of the G matrix for the kth PE are stored in the following way: The first address of the PE memory contains G k,k ; the second address contains G k,k+1 , and so on. For example, in the case shown in Figure 2(b) , the second PE would have its G-matrix entries organized as follows: { G 2,2 , G 2,3 , G 2,1 }.
In the first clock cycle, the kth PE has access to both G k,k and s (t −1) k , so it can compute the G k,k s (t −1) k product, and store the result in its accumulator. At the same time, this PE passes its current s (t −1) entry to the subsequent PE in the chain, i.e., the (k−1)th PE. Passing the current s (t −1) entry to the next PE will be performed in all PEs during all the clock cycles of the MVP computation. An example of this first clock cycle is shown in Figure 2(b) , where the second PE computes G 2,2 s 2 and passes the s 2 value to the first PE, so that the first PE has access to s 2 in the second clock cycle.
In the second clock cycle, the kth PE receives s (t −1) k+1 from the (k + 1)th PE, the previous PE in the array. As a result, the kth PE can compute G k,k+1 s (t −1) k+1 and accumulate it with the previous product. For example, in the second clock cycle of Figure 2(b) , the second PE receives s 3 from the third PE and is hence able to compute G 2,3 s 3 and add it with G 2,2 s 2 .
In the third clock cycle, the kth PE receives s
k+2 from the (k + 1)th PE. Note that, while the s (t −1) k+2 value was initially in the (k + 2)th PE, the (k + 2)th PE previously passed this value to the (k + 1)th PE. Then, the kth PE has the necessary operands to continue executing its complex-valued MAC operation. To clarify this aspect, let us examine the third clock cycle in Figure 2 (b): the second PE has access to G 2,1 and s 1 to finish its computation. While the second PE received s 1 from the third PE, the third PE received s 1 from the first PE in the previous clock cycle.
This example illustrates how the exchange of values between PEs enables each entry of s (t −1) to circulate through the linear array of PEs. In the case of the example in Figure 2 (b), a complete circulation of all values in the shift registers takes three clock cycles. In general, after K + 1 clock cycles, all PEs have had access to all the entries of s (t −1) and used them to compute their respective entry ofq (t ) . During this procedure, the first PE (which is implemented differently than the other PEs) executes the same procedure: it passesš to the (K + 1)th PE during the first clock cycle, and in the subsequent clock cycles sends the s (t −1) entry received from the second PE to the (K + 1)th PE. As the MAC units contain three pipeline stages, two clock cycles are required to flush the pipeline. Hence, the MVP operation in
Step (17) of Algorithm 1 is computed in only K + 3 clock cycles. We now describe the operation of the projection unit shown on the right side of Figure 1 , which implements s (t ) = prox C K +1 O ( q (t ) ). For QPSK, the projection unit of the kth PE consists of two identical modules: one module takesq = (q (t ) k ) as its input, while the other module takesq = (q (t ) k ) as its input. For BPSK, only the module with an input of q = (q (t ) k ) is required. The function of each module is to compute q and to clip its value in case it exceeds an absolute value of 1. In other words, each module determines if q is smaller than −1, larger than +1, or in-between these numbers, and outputs −1, +1, or q, respectively. As > 0, to determine if q ≥ +1, one can also checkq ≥ +1/ or, equivalently, ifq − 1/ ≥ 0, which can be extracted from the sign bit ofq − 1/ . Analogously, to determine if q < −1, one can also checkq + 1/ < 0, which can be extracted from the sign bit ofq + 1/ . As a result, each module takes its inputq and computesq − 1/ andq + 1/ . Using the sign bits of these two results, the module can select if the output will be −1, +1, or q. The quantity q is computed in parallel with the calculation ofq + 1/ andq − 1/ , which reduces the critical path. By restricting the parameter to be a power of two, the multiplication can be carried out with inexpensive arithmetic shifts. The projection unit uses one clock cycle to complete its operation, but it can start only after the K + 3 clock cycles used by the complex MAC unit. After the projection unit has finished, the new s (t ) k will be available at the inputs of the complex-valued MAC unit of the kth PE, ready to start a new iteration. Consequently, each PrOX iteration requires a total number of K + 4 clock cycles.
D. Implementation Details
We now provide the remaining implementation details, including the fixed-point parameters as well as memory considerations for our FPGA and ASIC designs.
1) Fixed-Point Parameters: To minimize area and power consumption, and to maximize the throughput, we deploy fixed-point arithmetic in our design. Specifically, our architecture uses 6 bit signed fixed-point values for representing the s (t ) entries, with 3 fraction bits. For the elements of G, 12 bit signed fixed-point values are used, with 11 fraction bits. While 18 bit values are generated at the outputs of the multipliers in the complex-valued MAC unit, only 15 bits are used in the subsequent adders and accumulators (of which 11 are fraction bits). The adder and subtractor that receive the outputs from the multipliers do not saturate, but rather wrap-around which reduces area and delay. In contrast, the accumulator adders saturate. The outputs of the complex-valued MAC unit are also 15 bit signed fixed-point values with 11 fraction bits. In the projection unit, the −1/ and +1/ quantities are represented using 12 bit signed fixed-point numbers with 11 fraction bits, and are added to the outputs of the complex MAC unit using 15 bit saturating adders. The outputs of these adders are used to determine the output of the projection unit. Furthermore, for all the considered system configurations, > 1. As is restricted to be a power of two, a multiplication with is implemented by an arithmetic left shift. The number of shifts is encoded with a 4 bit number. Only the 6 most significant bits of the shifted value are taken into the multiplexer of the projection unit, as the output of this multiplexer will become the next iterate s (t +1) . We note that these fixed-point parameters are sufficient to achieve near-floating-point performance; see Figures 3 and 4 in Section V-A for corresponding symbol error-rate performance results.
2) Memory Considerations: For the FPGA designs, the G-matrix memory is implemented using look-up tables (LUTs) as distributed RAM, i.e., no block RAMs have been used. For the ASIC designs, the G-matrix memory is implemented using latch arrays as described in [37] ; these memories are built from standard cells and simplify automated design synthesis, without a significant area penalty compared to SRAM macrocells.
V. RESULTS
We now show error-rate performance results as well as FPGA and ASIC implementation results. We also compare our design to the only other existing JED design reported recently in [17] .
A. Error-Rate Performance 1) Uplink Data Detection: Figure 3 shows uplink symbol error-rate (SER) simulation results for PrOX and APrOX, both running t max = 5 iterations. The simulation results are obtained from 50, 000 Monte-Carlo trials in a B = 16 BS antenna SIMO system with K = 16 time slots. We consider both an i.i.d. flat Rayleigh block-fading channel model as well as a line-of-sight (LoS) channel with a spherical wave model and a linear BS antenna array with λ/2-wavelength spacing as described in [38] . As reference points, we also include the performance of maximum ratio combining (MRC)-based data detection with both perfect receive-side CSI (denoted by "CSIR") and with conventional channel estimation, optimal ML-JED detection using the sphere-decoding algorithm put forward in [12] , and the recently proposed TASER (short for Triangular Approximate SEmidefinite Relaxation) algorithm [17] running for 10 iterations. Note that MRC with perfect CSIR is optimal for these scenarios. However, the assumption of perfect CSIR cannot be realized in practice and one must resort to channel estimation (CHEST), which entails a performance loss of more than 4 dB at 1% SER. PrOX and AProX achieve similar error rate performance. Furthermore, both of our algorithms significantly outperform MRC with CHEST and approach near-ML-JED performance but, in stark contrast to ML-JED data detection, at very low computational complexity. Our algorithms also outperform TASER, especially for QPSK modulation; see Section V-B for a hardware comparison. We also show the fixed-point performance of the VLSI designs of PrOX. The markers of PrOX and APrOX correspond to the fixed-point performance of golden models that exactly match the outputs of our VLSI designs, whereas the curves correspond to MATLAB floatingpoint performance-evidently, our fixed-point VLSI designs exhibit virtually no implementation loss.
2) Downlink Beamforming: Figure 4 shows downlink SER simulation results for PrOX and APrOX with the same system and algorithm parameters. We assume channel reciprocity [5] , i.e., the downlink channel is the transpose of the uplink channel. Hence, we can use the estimated channel acquired in the uplink for MRC-based beamforming in the downlink. We observe similar trends as for the uplink shown in Figure 3 . In particular, PrOX, APrOX, and TASER all achieve near-ML-JED performance. MRC and MRC that uses the estimated data vector s MRC to compute a re-trained channel estimate according toĥ = Yŝ MRC / ŝ MRC 2 2 (referred to as "RT") is able to approach our algorithms by 2 dB and 4 dB, respectively. Hence, we conclude that JED also significantly improves the reliability of data transmission in the downlink.
3) Performance vs. Complexity Trade-Offs: Figure 5 shows the trade-offs between the ASIC throughput of PrOX and AProX, and the minimum signal-to-noise ratio (SNR) required to achieve 1% SER in the uplink for SIMO systems with K = 16 time slots and a varying number of BS antennas. As a reference, we include the performance of MRC with perfect CSIR and that of the optimal ML-JED detector. We observed that by increasing the number of PrOX and APrOX iterations t max , the mean squared error (MSE) of the channel estimate is reduced monotonically and the desired SER is achieved at a lower SNR at the expense of reduced throughput. Clearly, PrOX outperforms APrOX for a small number of iterations; this implies that the preprocessing approximation proposed in Section III-D entails a small performance loss when a small number of iterations is used. Note that for BPSK modulation, t max = 2 PrOX iterations are typically sufficient to reach near-ML-JED performance at an ASIC throughput of 338 Mb/s per PrOX instance. For QPSK, 3 PrOX iterations are sufficient at an ASIC throughput of 451 Mb/s.
Remark 4:
For all provided simulation results, we have assumed perfect synchronization and a transmission free of impairments or pilot contamination. 3 Hence, the provided simulation results may not be representative for other system configurations or more realistic communication scenarios. The MATLAB simulator for PrOX used in this paper is available on GitHub: https://github.com/VIP-Group/PrOX; this enables the interested readers to investigate such aspects in more detail.
B. FPGA and ASIC Implementation Results
To demonstrate the efficacy of PrOX, we now provide FPGA implementation results on a Xilinx Virtex-7 XC7VX690T and ASIC implementation results in a 40 nm CMOS technology. We implemented for different array sizes N ∈ {5, 9, 17, 33} in Verilog on register transfer level (RTL) and hence, our FPGA and ASIC designs support near-ML-JED for K ∈ {4, 8, 16, 32} time slots, respectively. We also provide a comparison to the only other existing FPGA and ASIC designs for JED proposed in the literature [17] . 1) FPGA Implementation Results: Our FPGA implementation results were obtained using the Xilinx Vivado Design Suite, and are summarized in Table I . As expected, the resource utilization scales nearly linearly with the array size N. For the N ∈ {5, 9, 17} arrays, the critical path of PrOX is in the PEs' projection unit; for the largest N = 33 array, the critical path is in the distribution of the control signals. Table II compares PrOX with TASER [17] , which have been implemented on the same FPGA for a SIMO system with B = 128 BS antennas and communication through K = 8 time slots. PrOX requires significantly fewer resources and lower power than TASER, while achieving a substantially higher throughput. This makes our design superior to TASER in terms of both hardware-efficiency (measured in throughput per FPGA LUTs) and energy per bit. Concretely, PrOX is 20× more hardware-efficient and 8× more energy-efficient than TASER for the considered scenario.
2) ASIC Implementation Results: Our ASIC post placeand-route implementation results summarized in Table III were obtained using Synopsys DC and IC compilers with a 40 nm CMOS standard-cell technology. Figure 6 shows the corresponding ASIC layouts. For the N ∈ {5, 9} arrays, the critical path of PrOX is in the PEs' projection unit; for N = 17, the critical path is in the multipliers of the MAC unit and, for the largest N = 33 array, the critical path is in the address generation logic and readout circuitry of theĝ r k memories. [17] , which have been implemented on the same 40 nm CMOS technology for a SIMO system with B = 128 and K = 8. Similar to the FPGA case, the PrOX ASIC designs require significantly fewer resources and lower power than the TASER ASIC designs, while achieving a substantially higher throughput. Concretely, PrOX is 25× more hardware-efficient (in terms of the throughput per cell area) and 16× more energy-efficient than TASER for the considered scenario.
We also note that PrOX exhibits a similar (for BPSK) or better (for QPSK) SER performance than TASER, while using fewer algorithm iterations; see Figures 3 and 4 for the associated SER results. However, while PrOX is only suitable for JED in large SIMO systems, TASER is also able to perform near-ML data detection in coherent massive multiuser MIMO systems that use conventional pilot-based channel training [17] . Hence, the error-rate performance and hardware complexity advantages of PrOX over TASER come at the cost of reduced flexibility.
Remark 5: While a plethora of VLSI designs exist for data detection in coherent small-scale and massive (multiuser) MIMO systems that separate channel estimation from data detection (see [8] , [19] , [22] , [23] , [39] - [46] and the references therein), TASER is the only other SIMO-JED VLSI design that has been described in the literature [17] . Furthermore, low-complexity linear methods that are commonly used for conventional (multi-user) MIMO data detection cannot be used in the JED scenario (as briefly discussed in Section II-B). Hence, we limited our comparisons in Table II and Table IV to PrOX and TASER.
VI. CONCLUSIONS
We have proposed a novel joint channel estimation and data detection (JED) algorithm, referred to as PRojection Onto conveX hull (PrOX). Our algorithm builds upon biconvex relaxation (BCR) [18] , which enables near-maximumlikelihood (ML) JED performance at low computational complexity for large SIMO systems with constant-modulus constellations. We have provided theoretical convergence guarantees for PrOX and introduced an approximation that significantly reduces the preprocessing complexity. To demonstrate the effectiveness of our algorithm, we have proposed a VLSI architecture that consists of a linear array of processing elements (PEs). Our architecture enables high-throughput near-ML-JED at low silicon area and power consumption. We have provided FPGA and ASIC implementation results, which demonstrate that PrOX significantly outperforms the only other existing JED design [17] in terms of hardware-and energyefficiency as well as error-rate performance. More generally, PrOX constitutes a first step towards hardware accelerators that are able to find approximate solutions to problems that resemble the famous MaxCut problem in a hardware-efficient manner.
There are many avenues for future work. An extension of PrOX to compute soft-outputs in the form of log-likelihood ratios (LLRs) is part of ongoing work. Furthermore, a theoretical error-rate performance analysis of PrOX is a challenging open research problem. Finally, while some algorithms exist for JED in the MIMO case and for general (non-constant modulus) constellation sets [12] , [21] , they are either computationally complex or exhibit a considerable loss with respect to the optimal ML-JED performance. The development of efficient JED algorithms that can be implemented as highthroughput VLSI designs for large (multi-user) MIMO systems with arbitrary constellations is part of ongoing work.
APPENDIX A PROOF OF THEOREM 1
We need a closed form expression for the gradient so that we can say something about its convergence. We have ∂ q f (q (t ) , s (t ) ) = −Y T Yq (t ) + α(q (t ) − s (t ) ).
From the optimality condition for the minimization step (6), we have
and thus ∂ q f (q (t ) , s (t ) ) = α(s (t −1) − s (t ) ).
Now that we have a simple representation for the gradient, we can bound its norm. Note that f is strongly convex in q with parameter α − Y T Y , and so
where q s = arg min q f (q, s) . 
where s q = arg min s f (q, s). If we choose q = q (t −1) and s = s (t −1) in (22), then q s = q (t ) and we have f (q (t −1) , s (t −1) ) − f (q (t ) ,
Similarly, from (23) we get
Adding these two inequalities yields
Summing from t = 1 to t = T, and observing that the summation on the left telescopes, we get f (q (0) , s (0) ) − f (q (T ) , s (T ) )
Because the iterates remain bounded and f is continuous, the left-hand side of (25) is bounded from above. We can conclude that s (t −1) − s (t ) → 0, and from (21), we see that the residuals converge as well. Finally, because the sub-gradients of f depend continuously on s and q, any limit point of the sequence of iterates must have zero sub-gradients, and is thus a stationary point.
APPENDIX B PROOF OF THEOREM 2
Let (q * , s * ) denote a non-zero stationary point of f that lies in the interior of C O . Such a point satisfies the first-order conditions −Y T Yq * + α(q * − s * ) = 0 (27) α(s * − q * ) − βs * = 0.
From (28), we see that q * = α−β α s * . Plugging this result into (27) Since s * is non-zero, this shows that s * is an eigenvector of Y T Y with negative eigenvalue. This is impossible because the Gram matrix Y T Y is positive semi-definite. It follows that (q * , s * ) cannot satisfy the first-order conditions (27) and (28) , and thus, cannot lie in the interior of C O .
APPENDIX C PROOF OF THEOREM 3
Since G < α, we have the Neumann series expansion [36] in (11) . Hence, we can bound the approximation error from above as follows:
Here, step (a) follows from the triangle inequality, step (b) from the fact that the spectral norm is a consistent matrix norm, step (c) is a result of basic arithmetic manipulations, step (d) follows from geometric series expansions, and step (e) is the same expression in simplified form. We note that the proof continues to hold for any consistent matrix norm.
