Abstract-The recent paradigm shift towards the transmission of large numbers of mutually interfering information streams, as in the case of aggressive spatial multiplexing, combined with requirements towards very low processing latency despite the frequency plateauing of traditional processors, initiates a need to revisit the fundamental maximum-likelihood (ML) and, consequently, the spheredecoding (SD) detection problem. This work presents the design and VLSI architecture of MultiSphere; the first method to massively parallelize the tree search of large sphere decoders in a nearly-concurrent manner, without compromising their maximum-likelihood performance, and by keeping the overall processing complexity comparable to that of highly-optimized sequential sphere decoders. For a 10 Â 10 MIMO spatially multiplexed system with 16-QAM modulation and 32 processing elements, our MultiSphere architecture can reduce latency by 29Â against well-known sequential SDs, approaching the processing latency of linear detection methods, without compromising ML optimality. In MIMO multicarrier systems targeting exact ML decoding, MultiSphere achieves processing latency and hardware efficiency that are orders of magnitude improved compared to approaches employing one SD per subcarrier. In addition, for 16Â16 both "hard"-and "soft"-output MIMO systems, approximate MultiSphere versions are shown to achieve similar error rate performance with state-of-the art approximate SDs having akin parallelization properties, by using only one tenth of the processing elements, and to achieve up to approximately 9Â increased energy efficiency.
Abstract-The recent paradigm shift towards the transmission of large numbers of mutually interfering information streams, as in the case of aggressive spatial multiplexing, combined with requirements towards very low processing latency despite the frequency plateauing of traditional processors, initiates a need to revisit the fundamental maximum-likelihood (ML) and, consequently, the spheredecoding (SD) detection problem. This work presents the design and VLSI architecture of MultiSphere; the first method to massively parallelize the tree search of large sphere decoders in a nearly-concurrent manner, without compromising their maximum-likelihood performance, and by keeping the overall processing complexity comparable to that of highly-optimized sequential sphere decoders. For a 10 Â 10 MIMO spatially multiplexed system with 16-QAM modulation and 32 processing elements, our MultiSphere architecture can reduce latency by 29Â against well-known sequential SDs, approaching the processing latency of linear detection methods, without compromising ML optimality. In MIMO multicarrier systems targeting exact ML decoding, MultiSphere achieves processing latency and hardware efficiency that are orders of magnitude improved compared to approaches employing one SD per subcarrier. In addition, for 16Â16 both "hard"-and "soft"-output MIMO systems, approximate MultiSphere versions are shown to achieve similar error rate performance with state-of-the art approximate SDs having akin parallelization properties, by using only one tenth of the processing elements, and to achieve up to approximately 9Â increased energy efficiency.
Index Terms-Sphere decoding, parallel processing, large multiple-input-multiple-output (MIMO), lattice search Ç
INTRODUCTION
T HERE is a general consensus that future mobile [2] and local area wireless communication systems [3] , [4] shall be able to support very high peak user and network rates as well as very large numbers of connected devices, while keeping the latency requirements at very low levels. These needs have triggered a paradigm shift from orthogonal transmissions to systems where we intentionally transmit a large number of mutually interfering information streams, as in the case of multi-antenna (MIMO) deployments for aggressive spatial multiplexing. In this direction, and to keep detection complexity low, large and massive MIMO systems typically employ linear detection schemes, which however, can provide near-optimal performance only when the number of users is much smaller than the number of access-point or base-station antennas [5] , [6] . Thus, typical large/massive MIMO deployments leave a large portion of the MIMO channel capacity unexploited, just for coping with the inefficiency of the linear detection approaches. Alternatively, maximum-likelihood (ML) detection schemes, allow to efficiently demultiplex as many mutually interfering information streams (e.g., spatially multiplexed users) as the number of the observed signals (e.g., base-station antennas) [5] . Still, even after translating the ML problem into a tree search, and solving it by means of sphere decoding [5] , [7] , [8] the corresponding processing and latency requirements increase exponentially with the number of mutually interfering information streams, substantially exceeding the processing capabilities of general purpose processors. These processing requirements, along with the-soon to be reached-plateau in the speed of microprocessors [9] prevent traditional systems from supporting large numbers of mutually interfering streams and, therefore, from scaling the achievable throughput gains and device connectivity.
At the same time, emerging system-on-chip architectures promise tens or even hundreds of cores per chip [10] , something already feasible in graphics processing units (GPUs). In the presence of such multiple processing element (PE) architectures the complexity problem translates to the efficient utilization of available PEs or, equivalently, workload parallelization. Parallelizing the sphere decoder (SD) is a challenging task since its computational efficiency is determined by the ability to prune (i.e., exclude nodes from the tree search) large parts of the tree at an early stage of the tree search without compromising its algorithmic optimality. In order to achieve this, typical SD approaches providing the exact ML solution are of sequential nature. They start by finding a "good" candidate solution (i.e., one of relatively small euclidean distance from the received signal) and they continue searching sequentially for "better" candidates without compromising the ML optimality while applying tree pruning strategies that become more aggressive any time a new candidate solution is found. Trivial parallelization approaches consisting of (nearly) independent parallel subprocesses, can result in less efficient tree pruning, and in turn, in highly increased processing requirements.
An "ideal" SD workload parallelization should be scalable and able to consistently reduce latency given additional processing power. It should be complexity efficient and not substantially increase the overall workload when increasing the number of PEs. For implementation purposes it should be nearly "embarrassingly parallel" and therefore minimize dependencies and communication overhead, which introduce latency and can moderate if not obliterate scalability and parallel efficiency [11] . In order to effectively allocate available processing power (PEs), an "ideal" parallelization method should also be adjustable to the transmission conditions. The methodology should be generic and applicable to all kinds of SDs including breadth-first and depth-first as well as exact (guaranteeing the ML solution) and approximate. Finally, it should be transparent to the choice of the implementation platform. In Many-Processor Systems on Chips (MPSoCs) for example, a PE can be a designated processor, in FPGA designs, a specifically allocated part of the chip and in GPU implementations the PE can be a separate thread.
Many SD implementations involve parallelism, but without meeting the above characteristics. For example, both depth-first [8] , [12] and breadth-first [13] SDs perform several euclidean distance calculations in parallel, at each level of the SD tree, exploiting a limited degree of data parallelism. However, before the next data set can be processed in parallel, the necessary sorting operations introduce significant dependencies. This strategy is highly-dependent on the specific realization platform, inflexible, and cannot be efficiently employed to decrease latency in large SDs. Similarly, [14] proposes a low-dimensional, real-valued SD of limited degree of parallelism, accelerating the sequential case by only up to 2Â and only in low SNRs.
In GPU implementations, Khairy et al. [15] concurrently run multiple, low-dimensional (4 Â 4) SDs without though parallelizing each SD. Wu et al. [16] and J osza et al. [17] parallelize a low-dimensional MIMO detection process on GPUs. However, Wu et al. use a Trellis-decoder-like approximation of the SD which is not efficient for dense modulations and large MIMO systems, and J osza et al. perform aggressive and nearly exhaustive parallel search of multiple subtrees without accounting for the overall complexity and by exhaustively trying different partitioning configurations. Hence, their approach is inappropriate for MPSoC or FPGA implementations and lacks theoretical reasoning.
Yang et al. [18] , [19] propose a multi-core architecture for parallel high-dimensional SDs i.e., to the best of our knowledge the only other multi-core depth-first SD in the open literature. To parallelize the tree search, the SD tree in [19] is partitioned in subtrees consisting of only one node on the higher levels, and all possible nodes at the lower levels of the tree. The authors' SD partitioning starts first by allocating all the nodes of the higher level to subtrees, and each of the subtrees is then partitioned using the same principles provided that there are still available PEs. This partitioning strategy is very practical in terms of implementation, but cannot adjust to the transmission conditions [20] , [21] , [22] . Furthermore, to avoid visiting a node twice and control the overall complexity, Yang et al. employ an interconnection network to determine which of the nodes will be processed by which PE and to distribute the most promising solution from each of the subtrees. This reduces the flexibility of the approach, and therefore its efficiency when applied to platforms that require SIMD processing, as is the case of GPUs or implementations on individual processing blocks.
The fixed complexity SD (FSD) [23] sacrifices the ML optimality to acquire good parallelization properties. However, to efficiently parallelize such an SD, the available number of PEs should be a multiple of the order of the transmitted constellation. In addition, the way the FSD determines the tree paths to process in parallel, is pre-defined and cannot adjust to the transmission conditions. As shown in Section 3.2.4, our proposed approach can rectify these weaknesses. Koo et al. [24] propose a parallel version of the FSD, but with each concurrent set of tasks being followed by sequential operations. Consequently, their approach is bounded by the FSD's error-rate, and while the corresponding complexity can be smaller than FSD's this comes at the cost of random processing latency, in contrast to FSD. Moreover, [24] requires significant synchronization overhead per tree level, which together with the complex control/data flow of the processing element makes this approach unsuitable for anything besides a general purpose processor. This work proposes MultiSphere, the first SD of an "ideally" parallelized SD. As result, MultiSphere not only claims unexploited throughput in large MIMO uplink transmissions where traditional precoding approaches are infeasible, but it also facilitates efficient MIMO transmission without channel knowledge at the transmitter side (both in the uplink and the downlink), enabling high-throughput, and low-latency signal transmission as envisaged by future ultra-reliable, low latency mobile communication (uRLLC) systems. MultiSphere's unique characteristics originate from its ability to early examine the candidate vector solutions (i.e, SD tree paths) that are most likely to include the transmitted vector. We will hereafter refer to these most promising paths as seeds. The process of identifying the seeds can take place a priori, (i.e., before any information is received), and is based on the transmission characteristics (e.g., MIMO channel and signal-to-noise ratio). In this direction, in Section 3.1, we first introduce the concept of the Tree of Promise where the symbols constituting a candidate vector solution are described by their relative ordered distance to the received signal (e.g., kth closest symbol to the received signal), without though requiring the actual value of this received signal. Then, to each node in the Tree of Promise we assign a novel Metric of Promise (MoP) which approximates the actual probability of that node to be part of the transmitted vector (see Section 3.1.1). As described in Section 3.1.2, MultiSphere then employs a novel tree partitioning method which, based on the identified seeds and the number of the available PEs, splits the Tree of Promise (and equivalently the search space) into subtrees while preserving the ML optimality. In Section 3.1.3, MultiSphere employs a new method to map the actual transmitted (e.g., QAM) symbols to each subtree without performing any sorting operations, in order to preserve the complexity and energy efficiency of the approach. Then, when all subtrees are searched in a nearly concurrent manner, we propose a new node traversal strategy and enumeration in Section 3.2.1 that minimizes unnecessary euclidean distance calculations and applies to both sequential and the proposed parallel SD; in contrast to existing approaches ( [5] , [12] , [25] ) that only apply to sequential SDs.
In orthogonal multicarrier systems, like OFDM where there is no mutual interference between the subcarriers one can perform parallel detection by utilizing an exact, depthfirst SD per subcarrier. However, as we discuss in Section 3.2.3, such a naive parallelization strategy is unable to efficiently reduce latency. On the other hand, as shown in Section 5.1, parallelizing each SD by means of MultiSphere and by sequentially processing each subcarrier, we can reduce latency by several orders of magnitude, for the same number of PEs and while preserving the ML optimality. Furthermore, in order to effectively exploit a large number of available PEs, Section 3.2.2 introduces a method that exploits the newly introduced MoPs to adjust the number of allocated PEs so that if their utilization is unable to significantly reduce latency, we can avoid PE redundancy and thus increased complexity. Then, based on this method, in Section 3.2.3 we propose a new PE scheduling approach, that allocates a variable number of PEs to different SD processes, and we show that such a scheme, is the first able to effectively reduce latency by increasing the number of PEs, while preserving ML optimality in a multicarrier multiantenna (MIMO) system.
MultiSphere has been evaluated in MIMO, spatiallymultiplexed, multi-carrier systems using both mathematically modelled channels and actual channel traces collected in an indoor environment. Several versions of MultiSphere have been considered, ranging from exact "hard" to approximate "soft-output". We show that compared to MIMO systems which exploit parallelism on a subcarrier level, MultiSphere can achieve two orders of magnitude reduction in processing latency at the same degree of parallelism while it can still exploit any amount of available PEs in order to further reduce this latency, even if that amount exceeds the number of subcarriers. Despite MultiSphere's seemingly more complex structure (see Section 3.2.1), our novel VLSI architecture and effective design concepts (Section 4) show that MultiSphere's VLSI processing throughput for exact detection with 32 processing elements for a 10 Â 10, 16-QAM system is 29Â higher against efficient sequential detectors, thus validating MultiSphere's efficiency. Similarly, for approximate soft-output detection our approach achieves almost an order of magnitude increased efficiency.
The rest of the paper is organized as follows. Section 2 introduces sphere decoding fundamentals. Section 3 presents MultiSphere's design while Section 4 proposes an efficient VLSI architecture for both hard and soft-output MultiSphere. Section 5 extensively evaluates the proposed algorithmic design and VLSI architecture and Section 5.1 discusses MultiSphere's extension to soft-output systems.
SPHERE DECODING FOR MIMO SYSTEMS
For a spatially multiplexed MIMO system consisting of n t transmit and n r receive antennae the received signal vector is y ¼ Hs þ w, where H is the n r Ân t MIMO channel matrix, s is the transmitted symbol vector whose elements belong to a constellation O of size jOj and w is the additive white Gaussian noise vector. By QR-decomposing the MIMO channel matrix as H ¼ QR the ML problem translates into findingŝ ¼ arg min s2O n t ke y À Rsk 2 , with R ij being the elements of R, and e y ¼ Q Ã y [5] , [7] , [8] . Since R is an upper triangular matrix, finding the ML solution can be transformed into a tree search of height n t and branching factor jOj. Each node at a level l can be characterized by its partial symbol vectors ðlÞ ¼ s l ; s lþ1 ; . . . ; s n t Â Ã which describes the path from the root to that node, as well as from its partial euclidean distance (PD) which can be calculated recursively as dðs ðlÞ Þ ¼ dðs ðlþ1Þ Þ þ cðs ðlÞ Þ, where cðs ðlÞ Þ ¼ e y l À P n t j¼l R lj s j 2 is the non-negative cost assigned to each branch. The ML problem translates then into finding the leaf-node with the minimum dðs ð1Þ Þ. In depth-first SDs with radius update and Schnorr-Euchner enumeration [5] , [8] , the initial squared radius r 2 ¼ 1. Any time a leaf node is reached for which dðs ð1Þ Þ < r 2 , r 2 is updated to dðs ð1Þ Þ. Upon meeting a node s ðlÞ , if dðs ðlÞ Þ !r 2 then this node, its children nodes and its siblings with all their descendants are excluded from the tree search (i.e., they are pruned). Following the Schnorr-Euchner treetraversal [26] for node expansion, the nodes are visited in ascending order of their PDs. Since depth-first SDs with radius update and Schnorr-Euchner enumeration have been shown to be very efficient in practice [5] , [8] and capable of delivering the ML solution, this is the structure that we will adopt for all our parallel SDs.
MULTISPHERE DESIGN
In order to describe all possible solutions, alternatively to the "traditional" SD tree, MultiSphere introduces the concept of a Tree of Promise where the symbols constituting a candidate vector solution (i.e., the SD tree nodes) are described by their relative ordered distance to the received signal. Then, MultiSphere introduces a new Tree of Promise partitioning method, which adjusts to the transmission channel without compromising the ML optimality (see Section 3.1). The Tree of Promise (and therefore the original SD tree) is split into subtrees which are processed in parallel by the PEs. This partitioning can take place offline, based on the average channel characteristics, or "on-the-fly", whenever the transmission channel changes following each QR decomposition. This adds preprocessing latency to that of the QR decomposition. However, the partitioning latency scales linearly with n t in contrast to the QR decomposition latency which scales almost cubicly with the number of transmit antennae. After SD partitioning, MultiSphere applies a new symbol-to-subtree allocation method (see Section 3.1.3) which, in contrast to other schemes [18] , maps nodes to PEs without introducing dependencies and minimizes redundant calculations across PEs. Each PE performs depth-first subtree traversal with Schnorr-Euchner enumeration [26] , according to which, nodes are visited in ascending order of their PDs. Several approaches have been proposed [5] , [8] , to avoid exhaustive calculation and sorting of the PDs. However, they are not applicable to MultiSphere since their ordering is sequential (finding the kth smallest PD, requires finding the ðk À 1Þth smallest PD first, starting from k ¼ 1). To that end, in Section 3.2.1 we introduce a new tree traversal and enumeration method which meets MultiSphere's needs.
MultiSphere's tree searches execute nearly independently. They interact only once, after they have all reached their first leaf node. The r 2 of each subtree is then replaced by that of the leaf node with the minimum PD across all parallel SDs. The search is terminated when all subtrees have been searched, the detection output being the leaf node with the minimum PD across all subtrees. The overall processing latency is determined by the slowest parallel SD.
MultiSphere's Preprocessing: SD Tree Partitioning
MultiSphere's SD partitioning consists of a) the seeds identification, which finds N PE paths (seeds) in the Tree of Promise, the ones most promising to constitute the correct solution (i.e., to be the transmitted vector) with N PE being the number of available PEs and b) the seeds to subtrees expansion, which assembles subtrees around these seeds so that their union forms the original SD tree.
Seeds Identification
The relative position vector (RPV) m describes a tree path, in the Tree of Promise by means of the ordered (in terms of PDs) position of its nodes to the received observable. If the lth element of m equals k, then, for the corresponding path, its node at level l is the kth closest node to the received e y l . If, for example, m ¼ ½1; 2; 3 T the path consists of the node with the third smallest PD at the highest level of the tree (i.e., mð3Þ ¼ 3), its child with the second smallest PD at the second level of the tree, and its child with the smallest PD at the lowest level of the SD tree.
Finding the exact probability for each path to include the correct solution is clearly an non-trivial task that would require difficult integrations with no obvious, closed-form solutions. In order to avoid such calculations we propose a Metric of Promise (MoP) M, related to a proposed approximation of the corresponding probability (please refer to the Appendix for details). In particular, the proposed MoP for an SD tree path with RPV m is
, where x m denotes the symbol vector related to path m and a l depends on the minimum distance d QAM between QAM symbols at level l (e.g., for d QAM ¼ 2, a l ¼ 1:11). Then, the smaller M m ð Þ is, the more likely it is for path m to include the correct solution.
The MoP in Eq. (1) 
where, provided that the same QAM constellation is used per transmit antenna, the terms a l will be the same for all l values, and can be ignored as they won't affect finding the paths with the smallest M s values. As we show in Section 5, M s is equally efficient with M regarding its ability to reduce decoding latency. However, the lack of knowledge of s 2 prevents from using methods similar to the one proposed in Sections 3.2.2 or 3.2.3 to avoid the unnecessary allocation of PEs or to efficiently allocate (schedule) PEs in multicarrier systems. We note that both metrics are independent of any channel statistics, and independently exploit each channel realization. The metric M s is an improved yet less complex version, of the heuristic MoP proposed in our original work in [1] . Since the MoPs are not a function of the actual received signal they can be pre-calculated before data detection (i.e., before sphere decoding). In addition, both M and M s can be calculated in a recursive manner, similarly to traditional SDs. As we explain in the end of this Section, the seeds identification does not need to be exact to preserve ML optimality. Thus, they can be found in a K-Best manner with K ¼ N PE , requiring latency of the order of N PE Án t .
Algorithm 1. MultiSphere's Seeds to Subtrees Expansion
MultiSphere-Seeds to subtrees expansion Input: m, jOj, n t , 
Seeds to Subtrees Expansion
The Seeds Identification process outputs N PE seeds, each with its own RPV m i (i ¼ 1; . . . ; N PE ) for each of which there is one node defined per tree level. The subtrees expansion of a seed is then used to construct a corresponding subtree T i (i.e., for each of which there is a range of nodes defined per tree level) which, similarly to the seeds, is also expressed in terms of RPVs. These subtrees will later be processed by the respective PE. Seeds m i are sorted in ascending order of their indices m i ðlÞ, starting from l ¼ n t . Seeds with the same elements from n t until a level l are sorted in ascending order of elements at level lÀ1. Then, for each m i the subtrees are created according to Algorithm 1, so that the union of all T i forms the original SD tree, in order to preserve ML optimality. In the example of Fig. 1 , the expansion of T 2 starts from the first node that is part of m 2 and not part of adjacent seeds, i.e., m 1 and m 3 (step 2). The algorithm continues by allocating sibling nodes and their descendants (step 3), and continues traversing up the tree and allocating sibling nodes with greater indices that do not belong to the following seed (step 5 allocates no such nodes). The algorithm will, in the worst case, reach level l ¼ 1, and then traverse up to l ¼ n t allocating nodes in the process, and resulting in a latency of Oð2n t Þ.
Symbol-to-Subtree Allocation
Tree partitioning provides the nodes to be processed by each MultiSphere SD as a function of their ordered distance. In Fig. 1 , for example, subtree T 3 , will consist of the 2 nd and 3 rd closest symbols at l ¼ 3 (or the nodes for which m 3 ð3Þ ¼ 2 and m 3 ð3Þ ¼ 3) and all their descendants. In high order systems, finding the actual symbols would require exhaustive PD calculations and sorting of the corresponding nodes multiple times across the parallel SDs. To avoid these redundant calculations, MultiSphere uses an approximate predefined order to allocate symbols to subtrees, based on calculating minimum euclidean distances depending on the relative position of the received point and the constellation geometry. In addition, it uses a symbol mapping of two-dimensional zigzag coordinates. In one-dimensional symbol constellations, the sorted order of the symbols in terms of their distance to the received point can be easily found in a zigzag manner [5] , [8] after finding the closest constellation symbol s ðcÞ l to the "equivalent (in the constellation domain) received point"
Using the zigzag concept each symbol in a two-dimensional constellation can be mapped in terms of its zigzag coordinates ðzz x ; zz y Þ, as shown in We note that despite the fact that there are four possible squares where y l can lie, all d min values, predefined orders and corresponding zigzag coordinates exploit constellation symmetry, significantly reducing the architectural overhead compared to the preordering initially proposed in [1] . We note that this approximate ordering does not affect optimality since the union of all constructed sub-trees still forms the original SD tree. In addition, since the relative sequence is stored via zigzag coordinates, mapping is feasible even if s ðcÞ l is an outer constellation symbol. Then, d min is still a valid lower limit of the corresponding euclidean distance, since zigzagging from s ðcÞ l will point to a symbol which is even further than what was initially assumed.
MultiSphere's Sphere Detection

Tree Traversal and Enumeration
Upon expanding a node, MultiSphere's SDs visit children nodes in ascending order of their PDs (see Section 2). To avoid calculating and sorting all PDs at each level, enumeration methods have been proposed [5] , [8] which, as discussed, are not applicable to MultiSphere due to their sequential ordering. MultiSphere performs the following enumeration instead. From the set of potential symbols to be visited at a specific tree level and for each existing zz x coordinate, we identify the symbols with the minimum zz y . This results in a subset of at most ffiffiffiffiffiffi ffi jOj p symbols having unique zz x s. Out of this subset, we calculate the PD of the symbol with the minimum zz x for which zz y ¼ 0, if such a symbol exists. In addition, we calculate and store in a buffer Q, of maximum size ffiffiffiffiffiffi ffi jOj p the PDs of the symbols with zz y 6 ¼ 0. We first visit (and remove from Q) the symbol with the smallest PD in Q. If ðzz 
Their union is the full SD tree. Nodes may appear among several subtrees. smallest PD in Q, i.e., ð3; 1Þ. Subsequently, we add ð3; 2Þ to Q and the process continues.
As verified in Section 5 by both our software and VLSI design evaluations, MultiSphere can preserve the ML optimality due to three reasons: (a) the "Seeds to subtrees" expansion is such that, independently of the seeds, the union of all parallel subtrees will include all the nodes of the initial (sequential) SD tree. As a result, no possible solution is excluded from the search. (b) While the mapping of nodes to subtrees is approximate, no node is excluded from the final search; some nodes may instead be mapped onto different subtrees. (c) The new enumeration approach ensures that each parallel tree search visits nodes in ascending PD order as in the Schnorr-Euchner enumeration, which also preserves the ML optimality.
Adjusting the Number of Allocated PEs
Depending on the channel condition and SNR, allocating more PEs to an SD tree search can possibly increase complexity without any further latency reduction. In order to achieve low latency without unnecessary PE utilization, MultiSphere can allocate to an SD search only K PEs, K being the minimum value for which
with M k being the kth smallest MoP. Subtrees expansion can then take place by using only K out of the N PE available PEs, and their corresponding K seeds. The sum in (4) approximates the probability that the correct solution lies among these K seeds. Therefore, when this probability reaches a predefined value b that we evaluate via simulations in Section 5.1 then no more PEs are employed. Fig. 8 for example shows that by setting b ¼ 0:5 at 16 dB SNR, allows utilizing 62 PEs to achieve the same average latency as a 128-PE MultiSphere at half of the latter's complexity.
PE Scheduling for MIMO Multicarrier Systems
In MIMO multicarrier systems, we can inherently parallelize the workload by allocating one sequential SD per subcarrier. However, and as we show in Section 5.1, such a method is inefficient when targeting the exact ML solution since the latency of the multi-carrier frame is determined by the "slowest" SD. As shown in Fig. 9 , if we instead use MultiSphere with 8-PEs to parallelize the detection of each subcarrier and process each subcarrier sequentially, we reduce latency by a factor of three compared to allocating one sequential SD per subcarrier (52 in total). However, as described in Section 3.2.2, when a large number of PEs is available to MultiSphere, allocating all PEs to a subcarrier may unnecessarily waste processing power. To efficiently use the available PEs, we hereby propose a PE scheduling approach according to which MultiSphere processes the several subcarriers sequentially, starting from the one demanding the most PEs. If Eq. (4) is unfulfilled for K PEs with K < N PE we process the subcarrier using all available PEs. If, on the other hand, (4) is met, still leaving enough available PEs for more subcarriers to fulfill (4) for their corresponding MIMO channel, we can then process these subcarriers in parallel. As shown in Fig. 10 , this allows for efficient PE utilization when there can be many more PEs than subcarriers and reduces latency by several orders of magnitude compared to per-subcarrier parallelization strategies.
Approximate, Fixed-Complexity MultiSphere
MultiSphere can provide the exact ML solution at substantially reduced processing latency compared to sequential SDs. This latency, though, can significantly vary with the SNR and channel condition. Solutions like the FSD [23] or breadth-first SDs [13] , [27] , [28] provide a fixed and predetermined processing latency by sacrificing ML optimality.
In the same manner, MultiSphere can be terminated at any time instant, after each SD finds its first candidate solution.
In such a case, MultiSphere's latency can be flexibly set to any value at runtime, in contrast to traditional, approximate breadth-first approaches which can only set latency at design time. Consequently, MultiSphere allows for efficient trade-offs between error-rate performance, latency and N PE for a given transmission scenario. From this family of approximate SDs, as discussed in Section 1, FSD enjoys akin parallelization properties with MultiSphere. However, while MultiSphere can use any number of PEs, the FSD requires N PE to be an integer power of jOj. In addition, the FSD determines the tree paths to run in parallel in a pre-defined manner and cannot adjust to the transmission conditions. In Section 5.1, we compare the FSD against an approximate, minimum latency version of MultiSphere (denoted as a-MultiSphere) which only visits the N PE most promising paths (i.e., seeds) and the SD's output is the seed with the minimum euclidean distance. This suboptimal approach requires neither MultiSphere's Seeds to Subtrees Expansion (Section 3.1.2) nor MultiSphere's Tree Traversal and Enumeration (Section 3.2.1) enabling high-throughput designs with a large number of information streams. In Section 5.1 we show that for a 16 Â 16 MIMO configuration, a-MultiSphere performs similar to the FSD with one eighth of the latter's PEs, while our VLSI post-synthesis evaluation in Section 5.2 shows that this advantage translates to almost an order of magnitude higher hardware efficiency.
In soft-output systems, calculating the soft information requires calculating multiple constrained ML problems [29] and therefore, MultiSphere is still applicable. However, the complexity of traditional soft-output SD approaches [29] , [30] , [31] becomes impractical when applied to large MIMO systems and their parallelization would thus require an impractically large number of PEs. To the best of our knowledge, the only soft detection solution that is applicable to large MIMO systems is the soft-output version of the FSD (hereafter referred to as SFSD). Therefore, for soft-output systems we focus on the approximate version of MultiSphere and compare its performance against the SFSD and the recent, partial marginalization-based approach of [32] . In Sections 5.1 and 5.2 we show that in a similar 16Â16 MIMO scenario and at 0.75 code rate, the soft-output version of a-MultiSphere achieves a performance advantage that is consistent with that of the hard-output version.
MULTISPHERE: VLSI ARCHITECTURE
The primary challenge in retaining MultiSphere's algorithmic advantage in practice lies in the design of a VLSI architecture which efficiently addresses MultiSphere's tree traversal and enumeration (Section 3.2.1). Of particular challenge is first, finding the subset of at most ffiffiffiffiffiffi ffi jOj p nodes which may be visited at a specific tree level without any taxing sorting operations, and second, selecting the next sibling node within the subtree T i by avoiding sequential or high-complexity comparisons. In the following, we present the design of a VLSI architecture which efficiently tackles both challenges at a minimal area and critical path overhead. We note that our architecture serves as a proof-of-concept of MultiSphere's principles when all PEs are allocated to a single subcarrier. As we will show in our evaluation, dynamic allocation of PEs to subcarriers is suitable when many PEs are available, necessitating algorithmic co-design and tradeoff study of efficient dispatchers and interconnection networks and is thus left for future work.
Parallel MultiSphere Detection Engines
MultiSphere's processing approach maps to the singleinstruction multiple data paradigm, i.e., tree-search is performed in parallel on multiple T i s. Moreover, MultiSphere's tree partitioning (Section 3.1) allows for detection with minimal dependencies and data exchange, as during detection only the radius is exchanged, only once and only in the exact ML case. To minimize processing latency and to allow evaluation with arbitrary PEs, our framework is based on distributed memory storage and de-centralized control units. Fig. 3 depicts the overview of MultiSphere's detector as well as the parallel engines' architecture, interconnecting an arbitrary number of detection engines with minimal overhead. The R matrices and the y values are broadcast to all engines, while T i s in the form of validity matrices vm are assigned to each engine. Exact engines signify having reached the first leaf on their T i s by a distinct signal which is also used as a self-disabling latched pulse. All PD values along with their partial vectors (i.e., keys and labels respectively) are processed by comparator networks arranged in a binary tree fashion, i.e., processor arrays denoted as MinTrees (MT eng ) which in essence serve as the engines' interconnection network. The network's output i.e., the final partial vector and partial distance is either broadcast to all detection engines (exact case) or forwarded to the output (approximate case). When all exact engines reach their first leaf, they store the output corresponding to the minimum PD, the self-disabling signal is reset and they resume on the next clock cycle. In the case of instantiating many detection engines, the critical path and fanout can be reduced through input storage replication. Despite MultiSphere's concurrency, exact-ML depth-first detection is stochastic and one cannot assume that the PEs will finish in a sequential order. Thus, employing the efficient reduction circuits in [33] , [34] would increase latency, particularly in the high SNR range. In the case of the softoutput a-MultiSphere, each engine's partial vector and partial distance pair are concatenated onto a single vector and forwarded to an LLR processor (Fig. 3) which calculates the Log-likelihood ratios and whose architecture we describe in the following section.
The MultiSphere Sphere Detector Engine
Exact Multisphere. In this section, we describe the internal design of the exact MultiSphere SD engine, which is based on a generalization of the folded one-node-per-cycle architecture first defined in [8] with additional node replacement and enumeration logic, following similar principles as [8] -ASIC-II. Despite the various SD approaches [8] , [13] , [18] , [28] , [35] , [36] , [37] , this is, to the best of our knowledge, the most efficient design for depth-first SDs that can find the exact ML solution while visiting one node per clock cycle. We employ integer arithmetic and present a VLSI architecture which enables MultiSphere's efficiency via bit-parallel operations also supporting traditional decoding and is also scalable to denser constellations as our evaluation shows.
Tree-Traversal Processor (TTP)
The TTP (Fig. 4-left) selects ffiffiffiffiffiffi ffi jOj p nodes to be visited on the current tree level l and computes their PDs. To avoid dividing by R ll , (Eq. (3)), we multiply all constellation point values by R ll . The processing datapath of a-MultiSphere's branch (Fig. 4, middle) is almost identical MultiSphere's TTP excluding the node collector described below.
Low-Complexity Multiple Constant Multipliers (MCMs). Due to the large number of multiplications required for detection, designing efficient constant coefficient multipliers is critical. Our designs employ a multiplier which stores only the positive integer constellation values in a ffiffiffiffiffi
! -bit lookup table (Fig. 4-right) . In order to map constellation points, we employ binary indices that also address the lookup table while the most significant binary index bit adjusts both the multiplication result and the final lookup address (for negative constellation values.) We also considered two additional multipliers: a) in which we store positive and negative integer constellation points in a ð ffiffiffiffiffiffi ffi jOj p Á ðlog 2 ð ffiffiffiffiffiffi ffi jOj p Þ þ 1Þ-bit lookup table (full-depth, denoted as MCM-F), and b) the flexible multiplierless approach of [20] for up to 64-QAM. Due to the exact ML nature of MultiSphere, we employ two's complement arithmetic (i.e., "2sC" in Fig. 4-right) instead of negation via NOT gates as in [20] . In Section 5.2 we present a thorough efficiency evaluation of multipliers via synthesis.
Slicer. Following the computation of y Á R ll , the engine determines s ðcÞ l via its slicer. To employ the aforementioned low-complexity multipliers, slicing relies on intermediate integer boundaries scaled by R ll . We then compare the received symbol with the scaled boundary and directly map it to a constellation point index. Additionally, we detect the received symbol's relative position (i.e., left or right) to the selected closest point for subsequent employment by our "Zigzag to index Mapper". We determine the relative position via two comparators and one "2sC" module for 16-QAM (six comparators and three "2sC" modules for 64-QAM).
Subtree Node Collector (SNC). One of MultiSphere's main architectural novelties, the SNC collects at most ffiffiffiffiffiffi ffi jOj p nodes which the detector will visit on a particular level, by efficiently avoiding complicated, sorting operations which would adversely affect the critical path. Our proposed approach is based on the MT ni comparison processors (Figs. 4 and 5 ) which operate in parallel to the y Á R ll calculation and the Slicer. To define and store the T i for each engine, we consider the nodes per level arranged in ascending order of their zigzag coordinates, first by zz x and then by zz y , i.e., ð0; 0Þ ð0; 1Þ . . . ð ffiffiffiffiffiffi ffi jOj p ; ffiffiffiffiffiffi ffi jOj p Þ. In this manner, we only need a single bit to denote a node's presence in T i and therefore defining the latter in a detection engine through a validity matrix (vm) requires n t Á jOj bits. For every unique zz x , we employ a single MT ni processor which outputs the minimum valid zz y . Note that the arrangement of the MT ni processors is a function of the constellation size and thus the actual zz x s need not be processed by the MT ni network. Thus, the MT ni comparators' width is just log 2 ffiffiffiffiffiffi ffi jOj p bits. Next, we generate a binary mask of ffiffiffiffiffiffi ffi jOj p bits which signifies (via zeroes) the nodes (among the ffiffiffiffiffiffi ffi jOj p chosen) for which zz y ¼ 0. In parallel, a single, more complex MT z processor processes both coordinates to denote the minimum zz x for which zz y ¼ 0 (Fig. 4) . We then decode the index into a ffiffiffiffiffiffi ffi jOj p -bit vector and OR the result with the mask to get the final valid nodes.
Zigzag to Index Mapper (ZIM). This mapper employs the closest point index, the single-bit relative position of the received symbol (left or right) and the zigzag coordinates from the SNC to generate the indices of the nodes selected at the current level. Mapping is based on an optimized lookup table to maintain a low area and critical path overhead.
Node Enumeration Processor (NEP)
The NEP determines the next sibling and stores the current search state per tree level (i.e., the node attributes). The state normally consists of the node's partial vector s ðlÞ , its PD, and a single bit flag which verifies validity. In the case of MultiSphere, the proposed 2D enumeration (Section 3.2.1) requires additional storage of zz x ; zz y per node. The current tree-search state also involves storing the euclidean distance dðs ðlþ1Þ Þ and y Á R ll , both of which the TTP has already computed. Each storage unit, organized as a register bank, stores ffiffiffiffiffiffi ffi jOj p elements of: a) log 2 ffiffiffiffiffiffi ffi jOj p bits for the partial vectors, b) parameterized width for the PDs, c) single bits for validity and d) 2 Á log 2 jOj 2 bits for zz x ; zz y . Additionally, we employ jOj bits per level to store the level's current validity matrix vmðlÞ.
Node Storage. Our proposed storage solution adopts a parallel load/store register file. To reduce switching activity, we employ fine-grained clock enabling on all SD registers. While MultiSphere traverses down the tree, its control unit ensures through a level write signal that only a single storage unit will be active. We notice (Fig. 2) that the order of the node indices per level depends on the received symbol (i.e., we cannot directly map the node's constellation index to its location in the buffer). We thus address our storage via zz x to avoid expensive permutations.
Node Replacement Processors. Depending on zz y , MultiSphere can replace each node with up to two siblings (Section 3.2.1). To efficiently find the replacements and maintain the one-node-per-cycle processing rate, we introduce two distinct modules of similar functionality, the Horizontal Node Replacement Processor and the Vertical Node Replacement Processor (HNRP and VNRP, Fig. 5 ). They consist of the Replacement Discovery Processor (RDP) that computes the attributes of a sibling and forwards these attributes to the PD / partial vector processor. The RDP's slicer and multiple constant multiplier (MCM) recompute the closest node's indices and their relative position to the received symbol. 1 As MultiSphere explores potentially a subset of the constellation at each T i level, incrementing zz y by one i.e., a simple vertical zigzag and then sequentially checking if each resulting node is part of T i (or the constellation), is indeed one solution albeit an inefficient one. We instead employ a ffiffiffiffiffiffi ffi jOj p -bit vector (valid bits from buffer in Fig. 5 ) corresponding to the current subconstellation state from the validity matrix vmðlÞ as selected via zz x . This stored part of the array is XOR-ed with the binary decoded zz y and the result is used as: a) a validity vector for the Replacement Discovery Processor's MT ni to select the minimum among all remaining valid zz y s and b) the new validity vector which will replace the corresponding part of vmðlÞ. Once all subconstellation nodes have been visited, the new validity vector consists of zeros and the visited all nodes signal is generated. The SD will then avoid storing the VNRP's result, invalidate the buffer location corresponding to the selected subconstellation and bypass PD calculation. Similarly, the HNRP's valid bits from buffer signal denotes instead the remaining valid nodes for which zz y ¼ 0. To avoid storage conflicts, only the VNRP invalidates buffer locations. Notice that now the VNRP and HNRP node attribute outputs have to be written to the storage unit's register file. The outputs' real indices are by definition different and this is exploited to establish conflictfree storage access. The node attribute vectors generated by each replacement unit are de-multiplexed into a specific location of the ffiffiffiffiffiffi ffi jOj p -element register file and the two de-multiplexed vectors are then merged into one by an OR operation. The merged vector is then employed for storage. When the replacement is invalid, decoding of the index is disabled and nothing gets stored.
Approximate Multisphere (a-Multisphere): a-MultiSphere's fixed processing complexity allows for a pipelined parallel VLSI architecture where we consider a processing element as the fully-instantiated logic required to process a single SD path across all tree levels. a-MultiSphere's processing datapath is almost identical to that of MultiSphere's (Fig. 4) TTP except for the Subtree Node Collector. Compared with the FSD, a-MultiSphere incurs a minor overhead, mainly involving the ZIM and the additional pipeline. In the n t ¼ 16 case, a-MultiSphere's initial latency is 219 þ log 2 ðN PE Þ clock cycles, corresponding to a pipelined MT eng array.
MinTree (MT) Processor Trees: MultiSphere's VLSI architecture relies on trees of MinTree processors (Fig. 5) : a) MT ni that output the minimum valid zz y per zz x , employed also inside the RDP, b) MT z that output the minimum zz x for which zz y ¼ 0, and c) MT eng interconnecting the engines and used within each detector. Fig. 5 -right shows the architecture of each MinTree processor. Each processor outputs a valid word containing the actual value (key), the corresponding rank (label) of each key and its validity bit, based on combinational (CMB) and comparison ( < > ¼) circuits. Notice that MT eng and MT ni are almost identical apart from the unused label in the latter. In Section 5.2 we assess the complexity of each processor, its contribution in the total cost of MultiSphere and in tree arrangements.
LLR Processor. For calculating the log-likelihood ratios, instead of the MT eng processor tree, the partial vectors and distances are initially processed by a streaming parallel bitonic sorting network (i.e., SN1 in [38] ), modified to wholly or partly employ each vector as the sorting key. We chose this particular architecture as it features the lowest latency while requiring the least amount of resources [38] . Apart from the sorted partial distances, the sorting network outputs N PE partial vectors of n t Âlog 2 jOj bits, i.e., equal to the number of LLRs that need to be calculated. The partial vectors are then regrouped into n t Âlog 2 jOj vectors of N PE bits each, where the first N PE -bit vector contains the leftmost bit from each "sorted" partial vector (i.e., corresponding to the sorted partial distances) and so on. Hence, the leftmost bits (hereafter referred to as the ML bits) of the regrouped vectors correspond to the minimum partial distance. Each of these ML bits along with each regrouped N PE -bit vector are input to a Flipped Bit Index Processor which computes the index of the first bit inside this vector that is non-equal to the ML bit. This selects one partial distance out of N PE , which, along with the minimum partial distance and 1 2s 2 are used to compute the final LLR after clipping. We note that the proposed LLR processor can be flexibly (e.g., sorting network size, number of index and clipping processors) instantiated to meet specific device requirements; for the purpose of exploration, our evaluation assumes a fully parallel sorting network of up to N PE keys and an expansion of up to 16Âlog 2 ð64Þ LLRs.
PERFORMANCE EVALUATION
Here we jointly assess MultiSphere's, exact and approximate, algorithmic and VLSI architecture performance. MultiSphere's algorithmic performance is evaluated via simulations in terms of processing latency 2 and overall complexity. 3 For our architectural comparisons, we implement the software and VLSI versions of the sequential approach for which we replace the exhaustive SE enumeration with the PAM-based enumeration in [12] (hereafter referred to as our Sequential PAM SD) since it a) scales better with dense QAM constellations and b) expands a subset of ffiffiffiffiffiffi ffi jOj p nodes on each level, similarly to MultiSphere's bound (Section 3.2.1) Unless specifically stated otherwise, all euclidean distance calculations employ the exact l 2 norm. For consistency and fairness, we compare a-MultiSphere against our own flexible FSD VLSI architecture for hard and soft information-based detection. All of our VLSI architectures follow the design principles of Section 4. For our evaluations the seeds have been identified via a K-Best SD, with K ¼ N PE .
Algorithmic Performance Evaluation
We first evaluate MultiSphere's exact version in an uncoded, 16-QAM modulated 10Â10 MIMO multi-carrier system assuming sorted QR decomposition (SQRD) as in [39] . We mathematically model each sub-channel between a 1. Note that the TTP has already computed these and they can be stored at the expense of increased area requirements in large antenna setups. This would slightly reduce node replacement delay only, as it lies outside the critical path (i.e., the TTP).
2. Via the number of visited nodes, assuming that one node is visited among those examined at every time instant [5] , [8] , [18] , [20] , [37] .
3. Via the number of partial distance calculations performed, depending on the SD algorithm [5] , [8] , [12] .
transmit-receive antenna pair as a 5 tap i.i.d. Rayleigh channel (in the time domain). We also evaluate MultiSphere via actual channel traces, measured 4 in indoor conditions. For our evaluations the channel is static over the transmission of a packet, i.e., it is assumed that one packet is transmitted per channel coherence time. As a result, preprocessing (i.e., channel estimation, QR decomposition, Seeds Identification) takes place once, at the beginning of each packet.
Single-Carrier Latency and Complexity. Fig. 6 verifies that MultiSphere's ML optimality in all cases. Fig. 7 depicts MultiSphere's latency and complexity for several N PE cases in comparison with the state-of-the-art Sequential PAM and Geosphere [5] SDs, for both MoPs (M and M s , Section 3.1.1). We note that our algorithmic evaluation does not consider the latency overhead of finding and distributing the minimum r 2 across the N PE s. Fig. 7 validates that MultiSphere can consistently decrease latency when N PE increases; for N PE ¼ 16, MultiSphere reduces latency by more than an order of magnitude compared to sequential SDs. Moreover, as Fig. 7 -bottom shows, MultiSphere reduces latency without substantially increasing complexity. In particular, the overall complexity can be even smaller than that of the highly-optimized, stateof-the-art sequential SDs examined. In addition, Fig. 7 -top shows that both of the the proposed MoPs i.e., M and M s attain a very similar latency reduction performance.
Moreover, Fig. 7 -top verifies that the latency advantage of MultiSphere is consistent for both mathematically modelled channels and actual channel traces. Fig. 7 also displays the latency and complexity for N PE ¼ 32, when, instead of using the proposed method (Section 3.1.1), we always include the most promising seed and randomly choose the rest. Compared to the sequential SD, this reduces latency only by approximately 20 percent though also increases complexity by 50 percent. For the same N PE , our proposed seeds identification method reduces latency by 29Â and has a lower complexity compared to the sequential SD.
Adjusting the Employed PEs. Fig. 8 highlights the efficiency of our method in Section 3.2.2 that adjusts the number of utilized PEs (K) and therefore complexity, as a function of b, without affecting the achievable latency. We note that for b ¼ 1 all available PEs are used (K 1 ¼ N PE ), which, as shown in Fig. 8 , results in excessive overall complexity in the high SNR region, whereas adopting a very small b leads to underutilization of the available PEs and a processing latency "floor". By setting b ¼ 0:5, a good trade-off between latency, complexity and number of utilized PEs is accomplished. Then, compared to allocating all PEs (b ¼ 1), we can reduce complexity by 50 percent without a noticeable latency increase, and with the ML solution still being guaranteed. Via extensive simulations we have validated that the approach is insensitive to the exact selection of b. Setting b to 0:5AE0:1 practically leaves latency and complexity unaffected.
Multi-Carrier Performance and Scheduling. Figs. 9 and 10 compare MultiSphere to a straightforward scheme adopting per-subcarrier parallelization according to which, one exact, Channel traces for single antenna users from 10-antennae Access Points (APs) were measured separately and combined for each 10 Â 10 channel realization. To emulate simple user selection strategies, the SNR of jointly scheduled users does not differ by more than 3 dB. sequential SD processes each subcarrier. Fig. 9 displays results using actual channel traces consisting of 52 active subcarriers for the purpose of more realistic comparisons. We see that conventional, per-subcarrier parallelization, fails to efficiently reduce latency despite the large number of employed PEs (52), since the subcarrier with the highest latency determines the multi-carrier block's latency as well. On the other hand, sequentially processing the subcarriers via an 8-PE MultiSphere, results in a 3Â latency reduction compared to per-subcarrier parallelization via sequential SDs (16 dB SNR). Moreover, in the high SNR range and with a 32-PE MultiSphere we reduce latency by more than an order of magnitude. Fig. 10 displays latency when targeting exact ML detection, for a varying number of subcarriers N SC , and multiple configurations and N PE values. We show that straightforward, per-subcarrier parallelization is incapable of efficiently reducing latency while preserving ML optimality. On the other hand, Fig. 10 shows that for the same degree of parallelism (N PE ¼ N SC ) MultiSphere combined with the PE scheduling of Section 3.2.3, and b ¼ 0:5, can provide a latency reduction of more than two orders of magnitude, compared to per-subcarrier parallelization (for N SC ¼ 512). In addition, we show that MultiSphere, in combination with the proposed scheduling, is the first method that can exploit any number of PEs (e.g., N SC Â 32) and consistently reduce processing latency, while still providing the exact ML solution. Without scheduling, even if the PEs are enough for MultiSphere to reach the minimum exact SD latency (i.e., 2n t À 1 nodes), due to the sequential subcarrier processing, the overall minimum processing latency will be of the order of Oð2 Á n t Á N SC Þ.
Approximate Detection. Fig. 11 compares a-MultiSphere (Section 3.2.4) with FSD and the K-best sphere decoder [13] . For fairness, a-MultiSphere, FSD and K-best decoder employ the sorted QR decomposition tailored to the FSD [23] . For the K-best detector we find the K-best siblings via a geometrically-based enumeration [28] , [41] . Fig. 11 shows that since aMultiSphere can focus its processing power on the most promising paths to include the correct solution, it consistently outperforms FSD for the same number of PEs, both for coded and uncoded systems, and with the gains increasing when smaller error-rates are targeted. Then, depending on the SNR, and for a 16 Â 16, 64-QAM modulated system, a-MultiSphere can provide a similar error-rate performance to the FSD by utilizing less than one tenth of the PEs (384 instead of 4,096). In addition, as discussed in Section 1, FSD requires N PE to be a multiple of the order of the transmitted constellation, which makes FSD inefficient for very-dense constellations. On the contrary, a-MultiSphere can efficiently utilize any number of available PEs. Notice that for SNRs of practical interest, the K-best SD can achieve a similar error-rate as a-MultiSphere, albeit at a much higher complexity premium i.e., requires 14,400 nodes, or equivalently, 900 a-MultiSphere paths (3,592 for K ¼ 64).
MultiSphere for Soft-Output Systems. In Fig. 12 we compare the soft version of a-MultiSphere (where soft information is approximated by only exploiting the N PE seeds), to the softoutput version of the FSD (i.e., SFSD) [42] , and to the partial marginalization-based AIR-PM detector [32] for its minimum complexity, where one layer is fully expanded, and 448 paths are visited. To the best of our knowledge, these are the only approaches that can practically apply to large MIMO systems. Fig. 12 shows that, similarly to the hard-output case, the number of required PEs for a-MultiSphere to achieve the same error-rate performance with SFSD can be nearly an order of magnitude fewer. Moreover, for the same number of visited paths, MultiSphere substantially outperforms AIR-PM in terms of error-rate. In contrast to MultiSphere, AIR-PM is not as flexible, hence in order to further improve the latter's errorrate performance, an impractical number of approximately 3 Á 10 4 nodes would need to be visited.
VLSI Architectures Evaluation
In order to explore scalability, we first assess the area and delay of the exact and a-MultiSphere PEs as well as the LLR processor's for jOj 2 f16; 64g-QAM modulation and n t 2 f4; 8; 16g. We initially employ 24-bits for dðs ðlÞ Þ, 16-bits for R and 18-bits for the noise variance, in order to assess the worst case impact of scaling n t and jOj (i.e., to reach ML performance at n t ¼ 16; jOj ¼ 64). Next, we jointly evaluate area requirements, performance and dynamic power consumption based on our algorithmic results (Section 5.1) using a 16-bit datapath for 16-QAM and retaining the 24-bit dðs ðlÞ Þ for a-MultiSphere at 64-QAM. The detectors' highly modular and parametric Verilog RTL code is synthesized using the Synopsys Design Compiler and TSMC 45 nm standard cell libraries at 25 C and 0.9V. We apply the actual channel traces to simulate the gate level netlist and generate the corresponding switching activity files. We then estimate the worst-case (i.e., the channel changes with every new subcarrier) average dynamic power consumption for the SNR values of Section 5.1 using Synopsys' Power Compiler and respectively evaluate hardware and energy efficiency (denoted as H eff and E eff ) via the bps/GE and Joules/bit (J/bit) figures of merit. 5 We also compare our post-synthesis results via technology scaling 6 with the exact, and the approximate l e 1 enumeration SDs in [43] , the approximate method in [44] (which supports only up to QPSK) and the high-throughput SDs in [45] , [46] . We note that our designs can instantiate arbitrary N PEs ; hitherto presented results are only indicative due to workstation memory limitations. 7 Single-Engine Scalability. Here we show that MultiSphere's relative architectural overhead can be kept at low levels and decreases in large antenna setups (which constitute this work's main focus). We compare area requirements and maximum achievable frequency of the MultiSphere, Sequential PAM SD, a-MultiSphere and FSD PEs. Post-synthesis results in Fig. 13 show that the exact and a-MultiSphere's respective gate count overhead is reasonable at 26 and 10 percent for n t ¼ 4, reducing to 13.9 and to 7.3 percent in the n t ¼ 16, 16-QAM case (45.5, 18.9 and 20, 11 percent respectively for 64-QAM). Increasing n t to 8 from 4 roughly doubles the exact architectures' area which then becomes 2.3Â larger for n t ¼ 16 (in the approximate case the average factor is 3.5Â). Frequency-wise, the Sequential PAM SD's advantage is less than 4 percent at 16-QAM for n t ¼ 8 (i.e., 389 v. 377 MHz) and less than 8 percent at 64-QAM (345 v. 322 MHz), while it is diminishing for larger n t s. Similarly, a-MultiSphere achieves 1.176 GHz (jOj ¼ 16) and 1 GHz (jOj ¼ 64) for n t ¼ 8 (the FSD respectively achieving 1.250 and 1.030 GHz). Fig. 13 -right shows that setting N PE ¼ jOj and for 16-QAM modulation the sorting network accounts for up to 46 percent of the LLR processor's area (n t ¼ 4). Increasing n t to 16, expands the sorting network's gate count by up to 2.14Â due to the increase in the partial vector width and also increases the size of the LLR sub-processing arrays (Fig. 3) , which account for up to 30.3 percent of the total LLR processor's gates (i.e., the flipped bit index processors). When jOj ¼ N PE ¼ 64, the sorting network dominates the total gate count (i.e., 63.5 up to 72.4 percent for n t ¼ 16 and n t ¼ 4 respectively). In all of the displayed sorting network cases and due to pipelining, N PE does not affect the critical path as much as n t does; thus the LLR processor achieves 1.25 GHz for n t ¼ 4 and up to 833 MHz for n t ¼ 16.
Multi-Engine Scalability and Detection Performance. For N PE 2f8; 16; 32g at 16-QAM, the average maximum (i.e., at 16 dB SNR) algorithmic speedup of exact detection based on the number of visited nodes is 4Â up to 36Â (Fig. 7-top) . MultiSphere's VLSI post-synthesis processing throughput speedup for N PE ¼ 32 is close to the algorithmic speedup, i.e., approximately 29Â against the Sequential PAM SD and 30Â against a single MultiSphere processing element. Table 1 displays the maximum energy efficiency (E eff ) involving dynamic power consumption and the area required per exact and approximate PE at the maximum achievable frequency. Based on the above, we can configure the parallel engines for N PE 2f8; 16; 32g to respectively operate at 96.15, 20.45 and 10.68 MHz for which MultiSphere's dynamic power consumption decreases up to 27.4Â (i.e., to 1.09, 0.26 and 0.11 mW per engine for N PE 2f8; 16; 32g respectively at 25 C and 0.9 V). Additional power consumption savings can be achieved through voltage scaling. Note that even though the exact SD in [43] has a lower area footprint due to a PSK-based enumeration, the proposed architectures achieve higher clock frequencies (Table 1) .
In multi-carrier detection (Figs. 9, 10 and 14-left), MultiSphere is significantly more efficient than conventional parallelization via sequential SDs, and its efficiency increases with N PE . When processing a frame with N SC ¼ 52, a single MultiSphere VLSI detector can achieve a speedup of 2.5Â ( SDs of [43] . Thus, MultiSphere's hardware efficiency is higher than all efficient sequential architectures: 6Â (N PE ¼ 8, 10 dB SNR) up to 25Â (N PE ¼ 32, 16 dB SNR) against our Sequential PAM and even 3Â up to 13Â against the very low complexity l e 1 SD of [43] . Figs. 14-left and 10 show that further increasing N SC Fig. 13 . Detection engine scalability and overhead with respect to n t and jOj: area (GE) breakdown at maximum frequency for MultiSphere, the Sequential PAM SD, a-MultiSphere and FSD (left plots). The right plots depict the LLR processor's scalability and overhead when setting N PE ¼ jOj. T . 7. We finally note that input to the detection engines is assumed to be managed externally in line with the literature [8] , [13] , [18] , [21] , [28] , [43] , [44] and is beyond the scope of this work. also increases total latency in all cases. Still, for N SC ¼ 512, a single MultiSphere detector with N PE ¼ 32 at 16 dB SNR maintains approximately constant efficiency, at 280 pJ/bit and 124 bps/GE, while all sequential SDs are up to two orders of magnitude less efficient. Even the very low area footprint, approximate l e 1 SD of [43] , achieves only 1.02 bps/GE in this case. 8 Note that aforementioned results do not take into account the additional logic which would be required in order to distribute/collect the multiple subcarriers to/from the sequential SDs and which would procure an even more favourable result for MultiSphere. Fig. 14 -right compares a-MultiSphere's and the FSD's energy and hardware efficiency at 500 MHz (well-below the frequency of Table 1 in order to allow performance projections for large N PE values) for N PE 2f1; . . . ; 128g in the case where the FSD expands two levels (i.e., 4,096 paths). Requiring just 512 paths to reach the same ML-approaching error rate (Fig. 11) , a-MultiSphere achieves 8.72Â higher energy efficiency and 9.63Â higher hardware efficiency when taking into account the pipelined minimum search unit. The soft-output a-MultiSphere can have up to 7.04Â higher area efficiency and 6.76Â higher energy efficiency (for 4,096 v. 512 paths as in Fig. 12) . A-MultiSphere's energy efficiency advantage can decidedly escalate when assuming a statically instantiated LLR processor capable to process the required number of paths in parallel (Fig. 14) . Against the recent state-of-the-art, such as the 16Â16 MIMO in [44] (only supporting up to QPSK), our 16-PE a-MultiSphere is an order of magnitude more hardware efficient and 6Â more energy efficient. A 12-PE a-MultiSphere architecture attains a similar error-rate performance to the K-best SD in [45] (K ¼ 10), yet achieves 4Â higher hardware and 12Â higher energy efficiency. Finally, a 24-PE a-MultiSphere is of similar error-rate to the non-constant (K 2½1; 16) K-best SD in [46] and slightly more hardware but less energy efficient. We note though that we target a flexible proof-of-concept using complex enumeration, not a specifically optimized case; we also assume that the channel changes with every subcarrier while [46] does not detail power estimation assumptions. Complexity Assessment-MCMs and MT Processors. To provide a clearer perspective to the reader, we conduct a complexity assessment of MultiSphere's main arithmetic modules i.e., the MCMs and the MT processors. Due to the folded design and its exact nature, MultiSphere requires fewer MCMs but more MT processors (Table 2) . a-MultiSphere on the other hand features more computationally intensive yet simpler operations. We also assess the efficiency of the proposed MCMs against those in [20] via the area-delay and energy-delay products assuming 16-bit input for all cases. Note that [20] defines a single flexible MCM for jOj 2 f16; 64g. For fairness, we employ distinct, optimized versions per modulation. By "2sC" and "neg" we respectively distinguish between two's complement and negation units (Fig. 4) . Results show that for hierarchy-preserving synthesis, [20] with two's complement is slightly more efficient at 16-QAM. At 64-QAM both of the proposed solutions are more efficient even against the simple negation of [20] . Notice that when the synthesis tool aggressively optimizes the design (retiming strategy), the proposed MCMs are more efficient in all cases. We chose the MCM-H due to the exploration scope of the paper, in line with hierarchy-preserving synthesis. Regarding the MT processors, MT z is the most complex, but intentionally also the one least employed. Notice (Section 4) that the tree size inside the detectors has Oð ffiffiffiffiffiffi ffi jOj p Þ complexity, while the interconnection where MT eng is employed has OðN PE Þ complexity. Utilization of the MT processors in trees displays a close to linear behavior while the area and power of the tree are almost negligible compared to that of the rest of the PE (i.e., 3.6 and 3.1 percent of total respective area and power at 100 percent utilization for a tree of 64 MT eng processors and N PE ¼ 1 for a-MultiSphere). We note that the MT results for Table 2 employ i/o registers in every processor and thus more closely reflect the a-MultiSphere case. MultiSphere's MTs exhibit a very similar relative cost, though a single processor can have up to 79 percent lower area compared to Table 2. MultiSphere's critical path lies within the TTP and even a 512 MT eng interconnection achieves below 3.26 ns delay. Moreover, the MT eng tree attributes a small fraction to MultiSphere's dynamic power consumption i.e., 0.85 mW for N PE ¼ 32 at 16 dB. Note that the LLR processor employs generic multipliers as the ones used for l 2 norm calculation; thus soft-a-MultiSphere retains the same MCM count. The additional processors are attributed to a) the sorting network (i.e., 
Calculated as
CONCLUSIONS
This work proposes MultiSphere, the first method to consistently and massively parallelize large sphere decoders, and consequently the fundamental ML detection problem, in a nearly-"embarrassingly" parallel manner, while accounting for the transmission channel. Joint algorithmic/VLSI evaluation shows that MultiSphere is the first approach able to substantially and consistently reduce latency at a small complexity overhead. Our efficient VLSI architecture performs close to the algorithmic bound and in multi-carrier detection is up to two orders of magnitude more efficient than conventional parallelization employing the most efficient SDs in the literature. Moreover, a-MultiSphere's algorithmic performance enables our flexible VLSI framework to be up to an order of magnitude more efficient than highly optimized, state-of-the-art approaches. Besides large MIMO systems, MultiSphere enables the practical realization of a plethora of theoretical concepts the implementation of which is considered impractical. Such concepts include aggressive nonorthogonal multiple access (NOMA) schemes [47] , [48] , as well as "Faster than Nyquist" transmissions, including the promising Spectrally Efficient Frequency Division Multiplexing (SE-FDM) scheme [49] , [50] .
APPENDIX MULTISPHERE'S METRICS OF PROMISE (MOPS)
Here we first calculate an MoP that approximates the probability of an SD path to constitute the correct solution s ðtÞ . Then, a simplified MoP is given that does not require the prior knowledge of the noise variance s 2 but, as shown in Section 5, it is equally efficient with the original MoP in terms of its ability to reduce sphere decoding processing latency.
As described in Section 3.1, each tree path can be described by its relative position vector (RPV) m, with elements m l . Denoting as x m the symbol vector related to the path m, and with its lth element being x m l , our target is to find an MoP that approximates the probability P x m ¼ s ðtÞ Â Ã . Using Bayes' chain rule, this probability can be expressed as
andP n t m n t Â Ã ¼ P x mn t ¼ s ðtÞ n t h i :
We first calculate the probabilityP n t m n t Â Ã for the highest SD tree level. This equals the probability that s ðtÞ n t is the symbol with the m n t th smallest PD, or equivalently it is the m n t th closest QAM constellation symbol to the equivalent received observable y n t (see Eq. (3)), y n t ¼ s ðtÞ n t þ w n t ;
where w n t represents the equivalent additive white Gaussian noise of variance s 2 n t ¼ s 2 = R n t n t 2 . CalculatingP n t m n t Â Ã is a non-trivial task that would require complicated integrations with no obvious closed-form solutions. In order to simplify this task, the corresponding probability is approximated by using the pre-calculated minimum distance values that have been used in Section 3. 
with a depending on the minimum distance between the QAM constellation symbols. For a minimum constellation distance of two (where each QAM symbol dimension can take the values AE1; AE3; . . .) a value of a ¼ 1:11 is chosen, which minimizes the mean-squared-difference between d ðsÞ min ðkÞ and D min ðkÞ for a 64-QAM constellation. Since, in contrast to d ðsÞ min ðkÞ, D min ðkÞ is a strictly increasing function of the parameter k, we can approximate the probabilityP n t m n t Â Ã as P n t m n t Â Ã % P D min ðm n t Þ w n t < D min ðm n t þ 1Þ Â Ã :
Since the norm of the noise is Rayleigh distributed, and by applying the approximation in (9), the above probability can be easily calculated as P n t m n t Â Ã % e Fig. 16 shows the simulated and the analytically approximatedP n t for an inner QAM constellation symbol and verifies the validity of the proposed approximation. After approximatingP n t m n t Â Ã in (7), the probabilityP l m l ½ in (5) needs to be calculated to get P x m ¼ s ðtÞ Â Ã . For all the SD tree layers l with l < n t , the received observable after the QR decomposition is e y l ¼ X n t j¼lþ1 R lj s ðtÞ j þ R ll s ðtÞ l þ w l :
Thus, under the assumption imposed by (5) that x mq ¼ s ðtÞ q for all tree levels higher than l (i.e, that the corresponding symbols belong to the correct vector solution), the received observable at any level l, for any path, can be expressed as 
asymptotically, for large O j j, tending to the value of one as it should for the total probability. We note that the parameter a can, in general, differ with l, since different QAM modulation can be used per transmit antenna. Since the logarithmic function is monotonic, finding the most promising paths is equivalent to finding the paths for which the logarithm of their probability P x m ¼ s 
with M m ð Þ % Àln P x m ¼ s
The MoP of (15) 
and therefore, an approximate MoP can be defined as 
From (18) it can be easily seen that finding the paths with the smallest MoPs, does not really require the knowledge of s 2 , or even of the parameter a when the same QAM constellation is used from all transmit antennae. Therefore, the following simplified MoP can be used instead
Georgios Georgis received the BSc degree in physics from the Aristotle University of Thessaloniki, and the MSc and PhD degrees in computing from the University of Athens, Greece. His research interests include the design of parallel algorithms and architectures for real-time multi-dimensional signal processing, artificial intelligence and expert systems. He is currently a research fellow in the 5G Innovation Centre, University of Surrey in Guildford, United Kingdom. He is a member of the IEEE.
Chathura Jayawardena received the BEng degree in electronic engineering and the MSc degree in mobile communications from the University of Surrey, Guildford, United Kingdom, in 2014 and 2015, respectively. He is currently working toward the PhD degree in electronic engineering at Institute for Communication Systems, University of Surrey. His research interests include signal processing for communications, with an emphasis on detection methods for non-orthogonal transmission schemes. He is a student member of the IEEE.
Daniil Chatzipanagiotis received the BEng degree in electronic engineering and the MSc degree in mobile communications from the University of Surrey, Guildford, United Kingdom, in 2015 and 2016, respectively. His research interests include signal processing for communications, with an emphasis on detection methods for non-orthogonal transmission schemes.
Rahim Tafazolli is the professor of Mobile and Satellite Communications since April 2000, director of ICS since January 2010 and the founder and director of 5G Innovation Centre, University of Surrey, United Kingdom. He has more than 25 years of experience in digital communications research and teaching. He has authored and coauthored more than 500 research publications. He is regularly invited to deliver keynote talks and distinguished lectures to International conferences and workshops. He is co-inventor on more than 30 granted patents, all in the field of digital communications. He is regularly invited by many governments for advice on 5G technologies. He was advisor to the Mayor of London in regard to the London Infrastructure Investment 2050 Plan during May and June 2014. He has given many interviews to International media in the form of television, radio interviews and articles in international press. In 2011, he was appointed as fellow of Wireless World Research Forum (WWRF) in recognition of his personal contributions to the wireless world as well as heading one of Europes leading research groups. He is a senior member of the IEEE.
