The Large Intelligent Surface (LIS) concept has emerged recently as a new paradigm for wireless communication, remote sensing and positioning. Despite of its potential, there are a lot of challenges from an implementation point of view, with the interconnection data-rate and computational complexity being the most relevant. Distributed processing techniques and hierarchical architectures are expected to play a vital role addressing this. In this paper we perform algorithm-architecture codesign and analyze the hardware requirements and architecture tradeoffs for a discrete LIS to perform uplink detection. By doing this, we expect to give concrete case studies and guidelines for efficient implementation of LIS systems.
I. INTRODUCTION
The LIS concept has te potential to revolutionize wireless communication, wireless charging and remote sensing [1] - [4] by the use of man-made surfaces electromagnetically active. In Fig. 1 we show the concept of a LIS serving three users simultaneously. A LIS consists of a continuous radiating surface placed relatively close to the users. Each part of the surface is able to receive and transmit electromagnetic (EM) waves with a certain control, so the EM waves can be focused in 3D space with high resolution, creating a new world of possibilities for power-efficient communication. As pointed out in [1] , there is no practical difference between a continuous LIS and a grid of antennas (discrete LIS) as the surface area grows, provided that the antenna spacing is sufficiently dense. Based on this, we study a discrete version of a LIS for practical reasons through the rest of this paper.
There are important challenges from implementation point of view. The large number of antennas present in the LIS produces a huge amount of baseband data-rate, which needs to be routed to the Central Digital Signal Processor (CDSP) through the backplane network. As an example, a 2m × 20m LIS contains ∼ 28, 500 antennas in the 4GHz band (assuming spacing of half wavelength), with the corresponding radio frequency (RF) and analog-to-digital converter (ADC) blocks. Then, if each ADC uses 8bits per I and Q, that makes a total baseband data-rate of 45.5Tbps. This is orders of magnitude higher than the massive MIMO counterpart, where this issue has been analyzed [5]- [9] . LIS is fundamentally different to massive MIMO due to the potential very large physical size of the surface and the amount of data to be handled, which requires specific processing, resources and performance analysis. [10] , [11] are preliminary works addressing the distributed processing issue with high-level architecture and performance analysis, but they do not perform an evaluation of the required cost. For the best of our knowledge, there is not publication which performs analysis of the processing distribution, performance and the corresponding cost together for LIS.
In this paper, we propose to tackle those challenges leveraging algorithm and architecture co-design. At the algorithm level, we explore the unique features of LIS (e.g., very large aperture) to develop uplink detection algorithms that enable the processing being performed locally and distributed over the surface. This will significantly relax the requirement for interconnection bandwidth. At the hardware architecture design level, we propose to panelize the LIS to simplify manufacturing and installation. A hierarchical interconnection topology is developed accordingly to provide efficient and flexible data exchange between panels. Based on the proposed algorithm and architecture, extensive analysis has been performed to enable trade-offs between system capacity, interconnection bandwidth, computational complexity, and processing latency. This will provide high-level design guidelines for the real implementation of LIS systems.
II. LARGE INTELLIGENT SURFACES
In this article we consider a LIS for communication purpose only. Due to the large aperture of the LIS, the users are generally located in the near field. A consequence of this is that the LIS can harvest up to 50% of the transmitted user's power. This is one of the fundamental differences to the current 5G massive MIMO. One consequence of this difference, is that the transmitted power in uplink/downlink is much lower than in traditional systems, opening the door for extensive use of low-cost and low-power analog components. Another important characteristic of LIS is that users are not seen by the entire surface as shown in Fig. 1 , which can be exploited by the use of localized digital signal processing, demanding an uniform distribution of computational resources and reduced inter-connection bandwidth, without significantly sacrificing the system capacity.
A. System Model
We consider the transmission from K single antenna users to a LIS with a total area A, containing M antenna elements. We assume the antennas are distributed evenly with a distance of half wavelength. The M × 1 received vector at the LIS is given by
where x is the K × 1 user data vector, H is the M × K normalized channel matrix such that H 2 = M K, ρ the SNR and n ∼ CN (0, I) is a M × 1 noise vector. Assuming the location of user k is (x k , y k , z k ), where the LIS is in z = 0. The channel between this user and a LIS antenna at location (x, y, 0) is given by the complex value [1] 
is the distance between the user and the antenna, and Line of Sight (LOS) between them is assumed. λ is the wavelength.
B. Panelized Implementation of LIS
An overview of the processing distribution and interconnection in a LIS is shown in Fig. 2 . As it can be seen, we propose that a LIS can be divided into units which are connected with backplane interconnections. We will use the term panel to refer to each of these units. Each panel contains a certain number of antennas (and transceiver chains). A processing unit, named Local Digital Signal Processor (LDSP) is in charge of the baseband signal processing of a panel. LDSPs are connected via backplane interconnection network to a Central DSP (CDSP), which is linked to the backbone network. In the backplane network, there are Processing Swiching Units (PSU) performing data aggregation, distribution, and processing at different levels.
Based on the general LIS implementation framework, the number of panels P , the panel area A p , the number of antennas per panel M p , the algorithms to be executed in LDSP and CDSP, and the backplane topology are important design parameters we would like to investigate in this paper.
III. UPLINK DETECTION ALGORITHMS
The LIS performs a linear filterinĝ
of the incoming signal to the panels, where W is the K × M equalization-filter matrix, andx the estimated value of x.
A. Reduced Matched Filter (RMF)
The Reduced Matched Filter [11] is a reduced complexity version of the full MF, where the N p strongest received users by the i-th panel according to their respective CSI are used as filtering matrix, this is
where W RMF,i is the N p ×M p filtering matrix of the i-th panel, and h n is the M p × 1 channel vector for the n-th user, {k i } represents the set of indexes relative to the N p strongest users.
The corresponding strength of user n is defined as h n 2 .
B. Iterative Interference Cancellation (IIC)
IIC is an algorithm that allows panels to exchange information in order to cancel inter-user interference. The detailed description of the algorithm can be found in [11] , and the pseudocode for the processing at the i-th panel is shown below, where H i is the M p × K local channel state information (CSI) Algorithm 1: IIC algorithm steps for i-th panel
, and W i the local filtering matrix. U z and Σ z are the left unitary matrix and singular values of Z i−1 respectively. U eq is the left unitary matrix of H eq , and W i is made by the eigenvectors associated to the N p strongest singular values. Each iteration of the algorithm is performed in a different panel. Matrix Z is passed from one panel to another by dedicated links.
IV. LOCAL DSP AND HIERARCHICAL INTERCONNECTION
In this session, we describe the corresponding LDSP and backplane architecture that supports both the RMF and IIC algorithms. We assume the OFDM-based 5G New Radio (NR) frame structure and consider uplink detection only. 
A. Local DSP in each Panel
The architecture of the LDSP is depicted in Fig. 3a . After the RF and ADC, FFT blocks perform time-to-frequency domain transformation. The processing of the uplink signal is divided in two phases: formulation and filtering. During the formulation phase, the Channel Estimation block (CE) estimates a new H i for each channel coherence interval. In this paper we assume perfect channel estimation. The Filter Coefficient calculation (FC) block receives H i and computes the filtering matrix W i . FC performs complex conjugate transpose in the case of RMF and executes Algorithm 1 in the case of IIC. W i is then written to the memory. During the filtering phase, the Filters block reads W i and apply it to the incoming data. The Filters block reduces the M p × 1 input to a N p × 1 output (N p ≪ M p ), which is sent to the backplane for further processing.
B. Hierarchical Backplane Interconnection
To reduced the required interconnection bandwidth, a hierarchical backplane topology is developed to fully explore the data locality in the proposed algorithms. As shown in Fig. 3a , the backplane is divided into local direct panel-to-panel link (marked in blue) and global interconnection (marked in red and will be described in detail in the next sub-section). The local link is dedicated for low-latency data exchange between two neighboring panels, e.g., the Z i−1 in the IIC algorithm. The global interconnection will aggregate the N p × 1 filtering result from each panel to CDSP for final decision.
C. Tree-based Global Interconnection and Processing
For the global interconnection, we propose to use a tree topology with distributed processing to minimize latency (the latency grows logarithmically with the number of panels), as shown in Fig. 3b . There are several levels of processing switching units (PSU) in the tree to aggregate and/or combine the panel outputs. These hierarchical PSUs can reduce the overall bandwidth requirement of the backplane and also the Fig. 3b also shows the detailed block diagram of a PSU. It is flexible to support both RMF and IIC, and can be extended for other algorithms. Combination and bypass functionalities are used in RMF, while for IIC the streams are bypassed to the CDSP for final decision.
V. IMPLEMENTATION COST AND SIMULATION RESULTS
In this section, we analyze the implementation cost of the proposed uplink detection algorithms with the corresponding implementation architecture, in terms of computational complexity, interconnection bandwidth, and processing latency. The trade-offs between system capacity and implementation cost is then presented to give high-level design guidelines. For convenience, we summarize the system parameters in Table I .
A. Computational Complexity
In Table II , we summarize the required computational complexity for both RMF and ICC algorithms. The complexity includes both formulation phase and filtering phase and are normalized to panel area A P . In the filtering phase, the operations are the same for RMF and ICC, which is applying a liner filter of size N P × M P to the M P × 1 input vector.
The formulation phase of RMF includes the computation of h 2 for each user. For the IIC algorithm, the steps required for the formulation phase are shown in Algorithm 1. For step 1, which consists of of a singular value decomposition (SVD) of the K × K Gramian matrix Z i−1 , complexity is 17K 3 [12] .
Step 2 has a complexity of (M p + 1)K 2 , step 3 requires a complexity of 4M 2 p K + 13K 3 , and step 4 and 5 need M p KN p + N p K 2 . In Table II 
C. Processing Latency
The processing latency of the filtering phase can be formulated as L f iltering = T Filter + log 4 (P )T PSU , where T Filter is the time needed for performing the linear filtering and T PSU represents the PSU processing time as well as the PSU-to-PSU communication time.
The latency of the formulation phase differs for RMF and IIC. For RMF, the formulation phase is done in parallel in all the panels. The corresponding latency L form,RMF depends on the computational complexity C form, RMF , the clock frequency, and the available parallelism in the computation. On the other hand, the latency for IIC includes both computation and panel-to-panel communication. The worst case is L form,IIC = P T compute,IIC + (P − 1)T panel-panel , where T compute, IIC is the time for computing the filter coefficient and T panel-panel is the transmission latency between two consecutive panels.
22.5m
PSfrag replacements LIS Fig. 5 : Top view of the simulation scenario.
D. Results and Trade-offs
The scenario for simulation is shown in Fig. 5 . Fifty users are uniformly distributed in a 20m × 40m area in front of a 2.25m × 22.5m (height x width) LIS.
The average sum-rate capacity at the interface between panels and processing tree for both algorithms is show in Fig. 4 . The figures show the trade-offs between computational complexity (C filt in the vertical axis) and interconnection bandwidth (R global in the horizontal axis). Dashed lines represent points with constant panel size A p , which is another design parameter for LIS implementation. To illustrate the trade-off, we marked points A, B, and C in the figures, presenting 3 different design choices to a targeted performance of 610bps/Hz. Comparing the same points in both figures, it can be observed the reduction in complexity and interconnection bandwidth of IIC compared to RMF. We can also observe as small panels (e.g., point C comparing to point A) demand lower computational complexity in expense of higher backplane bandwidth. Once A p is fixed, the trade-off between system capacity and implementation cost (computational complexity and interconnection data-rate) can be performed depending on the application requirement.
VI. CONCLUSIONS
In this article we have presented distributed processing algorithms and the corresponding hardware architecture for efficient implementation of large intelligent surfaces (LIS). The proposed processing structure consists of local panel processing units to compress incoming data without losing much information and hierarchical backplane network with distributed processing-switching units to support flexible and efficient data aggregation. We have systematically analyzed the system capacity and implementation cost with different design parameters and provided design guidelines for the implementation of LIS.
