The Ultra-Reliable Low Latency Communication (URLLC) applications have been proposed in recent years, targeting a round-trip end-to-end latency less than 1 ms with high reliability. Therefore, an order of magnitude improvements are needed in all layers of the wireless communication stack. This is a particular challenge for the physical layer, where typically a processing time of the order of microseconds is required for the computationally intensive demodulation and error correction processing, among other operations. Conventionally, the reception of signals, the demodulation processing and the error correction processing are performed consecutively at the receiver. However, this approach is associated with processing times on the order of hundreds of microseconds, preventing URLLC. Therefore, this paper proposes a novel processing architecture, which is capable of performing reception, Orthogonal Frequency Division Multiplexing (OFDM) demodulation and turbo decoding concurrently, rather than consecutively, hence significantly reducing the processing time. In order to achieve concurrent operation, the OFDM demodulation is performed using a novel cumulative Fast Fourier Transform (FFT), which produces successively more reliable estimates of all transmitted symbols in each successive clock cycle. At the same time, a Fully-Parallel Turbo Decoder (FPTD) is used to recover successively more reliable estimates of all bits in each successive clock cycle.
I. INTRODUCTION
I N ADDITION to significantly increased throughput and reliability, a Ultra-Reliable Low Latency Communication (URLLC) paradigm has been proposed for the next generation wireless communication systems, targeting a significantly reduced end-to-end latency of less than 1ms [1] , [2] . An even lower latency is required for the communications beyond 5G [3] . Achieving an increased throughput and reliability as well as a significant reduction in latency represents a significant challenge [4] - [6] , having no simple solutions. However once realised, this URLLC paradigm will allow humans or machines to communicate with remote mobile devices and control them seamlessly, without suffering from the lag that prevents accurate control using wireless communication systems [7] . This will enable a wide variety of new applications in remote surgery, automated driving, and virtual reality, having significant economic and societal impact [8] - [12] . However, the end-to-end latency of a wireless communication system is fundamentally limited by its physical layer [13] , [14] , which performs demodulation and error correction, among other tasks. Different future applications of the URLLC mode will impose different demands on the physical layer latency. For example, machine-automated low-latency capital market trading relies on multi-hop wireless communication links, where financial institutions use algorithms running on their own computers for automatically buying and selling stocks, whenever they momentarily have different values on stock exchanges in different cities. In this application, each relay has in these links to have a sub-microsecond physical layer latency [15] , not including the propagation delay associated with each hop. By contrast, the so-called Tactile Internet [13] , [16] will allow humans to seamlessly control remote devices, provided that a physical layer latency of below 100 μs can be achieved.
However, state-of-the-art (SOTA) wireless communication systems have physical layer latencies that are significantly higher than these targets. For example, the world's fastest low-latency capital market trading links impose a physical layer latency of around 5 μs per hop, not including propagation latency. Meanwhile, SOTA implementations of the Long Term Evolution (LTE) cellular telephony standard have a physical layer latency, which significantly exceeds the 100 μs target of the Tactile Internet [13] , [17] . In order to achieve a high throughput and reliability compared to predecessor schemes, 3GPP LTE employs OFDM [18] - [20] for mitigating echoes in the wireless channel, as well as a turbo code for correcting any remaining transmission errors [21] - [26] Recently, a URLLC mode of operation has been introduced in the LTE standard, which maintains the combination of OFDM and turbo coding, but aims to reduce the processing time available for these operations by 7 times [27] . Motivated by this, we adopt the combination of OFDM and turbo coding in this paper, although this combination can also be expected to have applicability to other FFT-based modulation schemes and other iterative decoding channel codes, such as low density parity check (LDPC) codes. However, these techniques impose a high signal processing complexity upon the physical layer, particularly in the receiver. As shown in Fig. 1(a) , the processing of the receiver's FFT [28] , [29] cannot begin until the whole message block has been received, since each of its outputs is a function of the whole received block. Owing to this, the FFT produces all of its outputs simultaneously, preventing the turbo decoding process from beginning until after the FFT has been completed. In practical LTE deployments, the transmission latency incurred while receiving, the processing latency incurred while performing the FFT and the processing latency incurred while performing turbo decoding are each around 70 μs [13] , [23] , allowing pipelining as shown in Fig. 1(a) . The sum of these latencies is 210 μs, which already exceeds the above-mentioned 100 μs target, even without considering the latency associated with propagation, channel estimation, Multiple-Input Multiple-Output (MIMO) detection and transmitter processing.
This motivates our new architecture, in which the physical layer receiver components are operated concurrently, rather than consecutively. This approach is exemplified by Fig. 1(b) , in which the reception, FFT processing and turbo decoding of each block is performed concurrently, potentially facilitating sub-microsecond physical layer latencies in the case of low-latency capital market trading. In the case of the URLLC communications [18] , [20] , [30] , this approach can reduce the associated latency from 210 μs to 70 μs, which is within the above-mentioned 100 μs latency target of the Tactile Internet. This leaves 30 μs for propagation and for the remaining, lower-complexity physical layer components, including channel estimation, MIMO detection and transmitter processing. Indeed, it may be expected that the proposed technique can be extended to perform some of these operations concurrently with those of Fig. 1(b) , within the same 70 μs. In addition, the turbo codes adopt the advantage of lower decoding complexity and better error-correction performance compared to the LDPC codes at low coding rates that are motivated in mission-critical vehicular communications for the sake of ensuring low bit error rate (BER) and ultra-high reliability [31] , [32] . Our new contributions are as follows.
1) we propose a novel cumulative FFT, which is processed incrementally and concurrently with the FPTD of [33] , throughout the process of receiving a single OFDM symbol. Since the information carried by each turbo encoded bit is spread throughout the duration of the OFDM symbol, the proposed concurrent FFT can obtain some information about each bit as soon as the reception of the OFDM symbol begins, allowing turbo decoding to start immediately. As more and more of the OFDM symbol is received with passing time, the cumulative FFT can obtain more and more information about the turbo encoded bits, which can be fed into the concurrent turbo decoding process. 2) We show that if the turbo decoder can complete a sufficient number of iterations within the duration of the OFDM symbol, then it can achieve the same error correction performance as if the turbo decoding process had only began after the reception of the OFDM symbol had been completed. The rest of this paper is structured as follows. Section II provides a brief overview of techniques that are employed in our proposed architecture, including the FFT and the FPTD [33] . Following this, the proposed concurrent OFDM demodulation and turbo decoding architecture is proposed and detailed in Section III. The validation of this architecture is presented in Section IV, while its error correction performance and extensions to manage the trade-offs between latency, reliability and complexity are presented in Section V. Finally, we offer our conclusions and avenues for future work in Section VI.
II. BACKGROUND
This section provides an overview of the FFT and FPTD techniques employed in the proposed architecture, and defines the notation that is employed in the following sections. Section II-A introduces the concept of a novel cumulative FFT, which will be detailed in Section III-B. Meanwhile, Section II-B highlights the decoding FTPD process of [33] .
A. Fast Fourier Transform
In OFDM, a bit stream is decomposed into several parallel bit streams, each of which has much a proportionately reduced bit rate and is modulated onto a different subcarrier. In this way, rather than using a serial time-domain (TD) bit stream, OFDM uses many low-rate parallel frequency domain (FD) bit streams, which are less prone to dispersion. Typically, OFDM schemes are implemented using Discrete Fourier Transform (DFT) techniques [34] - [36] . To be more specific, the Inverse Discrete Fourier Transform (IDFT) is performed in the transmitter to generate a single TD OFDM symbol to represent the set of the FD bit streams, each of which typically carries a Quadrature Amplitude Modulation (QAM) symbol. Meanwhile, the corresponding DFT is performed at the receiver to recover the QAM symbols carried by the sub-carriers of the OFDM symbol. In the receiver, a N -point DFT can be defined as
where W N = exp(−j2π/N ). Here, y n is the nth sample of the received TD OFDM symbol, where the set of N samples are received consecutively, spread over time. Meanwhile, Y z is the zth FD subcarrier's QAM symbol, where each of the N FD QAM symbols is dependent on all N TD samples of the TD OFDM symbol. Note that in practice, the demodulator's DFT is typically implemented using the FFT, which has a significantly reduced complexity if N is high [28] , [29] . If N is a power of 2, a Radix-2 FFT is achieved by recursively partitioning Y z into odd-and even-indexed terms, across v = log 2 (N ) stages. The odd and even terms in Y z can be expressed respectively as
Here, each of (2) and (3) can be considered to be a DFT in its own right. This allows each of (2) and (3) to be further decomposed into two DFTs, comprising the odd and even elements, respectively. This may be repeated recursively, until the DFT has been fully decomposed into an individual radix-2 structure, completing the FFT. The block diagram for the example of a N = 16-point radix-2 FFT with v = 4 computation stages is depicted in Fig. 2 . The inputs on the left-hand edge of Fig. 2 represent TD samples. These inputs are first interleaved into a bit reversed ordering, separating the odd terms and even terms. The TD samples are passed through v = 4 stages of radix-2 butterflies, each of which performs a radix-2 FFT calculation, as shown in Fig. 3 . Following the final stage of radix-2 butterflies, the N = 16 FD samples are obtained.
In the conventional approach, reception and demodulation are performed serially, where the FFT operation does not begin until after all the transmitted TD samples are received. However, in our concurrent approach, the demodulation process is started promtly after a small number of samples have been received. Successively more TD samples are received in each of a series of successive clock cycles. In a naive approach, the full-length N -sample FFT may be replaced in each clock cycle, using all TD samples received so far and assuming zero values for all TD samples not yet received, as shown in Fig. 2 . In order to significantly reduce the complexity of repeating the N -point FFT in each clock cycle, we propose an efficient cumulative FFT in Section III-B. This eliminates all redundant calculations associated with zero-valued samples and reuses calculations from one clock cycle to the next. More specifically, an incremental part of the FFT is calculated in each clock cycle and these are accumulated across the series of clock cycles.
B. Fully-Parallel Turbo Decoder
Turbo encoders typically comprise two component encoders, which are used for encoding a sequence of information bits and an interleaved version of the sequence, respectively. Conventionally, turbo decoding is achieved using the Log-BCJR algorithm [37] , which relies on forward and backward recursions along a trellis representation of the upper and lower component decoders. The component decoders are operated alternatively, over serial iterations, exchanging information via the interleaver. However, the forward and backward recursions impose data dependencies, which limit the achievable degree of parallel processing, resulting in high processing latency. Several approaches have been proposed to improve the throughput and latency of the Log-BCJR turbo decoder, most of which focus on increasing the parallelism of the conventional turbo decoder [33] , [38] - [40] . To be more specific, we have previously proposed a FPTD algorithm [33] , [41] , which dramatically increases the parallelism of the decoding process and achieves significantly lower latency, by dispensing with the recursions of the Log-BCJR algorithm.
The FPTD operates on the basis of soft information in the form of Logarithmic Likelihood Ratios (LLRs). Here, we define the LLRb pertaining to a bit b ∈ {0, 1} as
where Y is the received signal. As depicted in Fig. 4 provided to the upper decoder. Also note that for the sake of simplicity we omit the discussion of the LLRs pertaining to the twelve termination bits in the LTE standard [42] .
Besides the LLRs, the kth algorithmic block in each decoder is provided with a vector of A forward state metricsᾱ u k−1 or α l k−1 , as well as a vector of A backward state metricsβ u k or β l k , where we have A = 8 in the LTE standard. When the kth algorithmic block in each decoder is activated, these state metrics are combined with the a priori LLRs in order to generate a vector of A forward state metricsᾱ u k−1 orᾱ l k−1 , as well as a vector of A backward state metricsβ u k orβ l k , which are provided to the neighbouring algorithmic blocks on either side in the same decoder. Furthermore, the kth algorithmic block generates a message extrinsic LLRb u,e 1,k orb l,e 1,k , which is passed by the interleaver to one of the algorithmic blocks in the other decoder, where it is used as a message a priori LLRb l,a 1,k orb u,a 1,k , respectively.
Rather than employing the alternated activation of the upper and lower decoders as in the SOTA turbo decoder, the FPTD exploits the odd-even property of the LTE interleaver to enable an odd-even processing schedule [43] , [44] . To be more specific, the interleaver corresponding to each of the 188 legitimate frame lengths K supported in the LTE standard only connects algorithmic blocks of the upper decoder having an odd index to algorithmic blocks in the lower decoder that have an odd index as well. Likewise, even-index algorithmic blocks in the upper decoder are only connected to algorithmic blocks with even indices in the lower decoder. As a result, the algorithmic blocks can be grouped into two sets, where no two blocks in the same set have connections to each other. To be more explicit, the first set consists of all the odd-indexed blocks in the upper decoder, along with all even-indexed blocks in the lower decoder. Likewise, the second set comprises the remaining blocks, namely the even-indexed blocks in the upper decoder, together with the odd-indexed blocks in the lower decoder.
In the FPTD's odd-even processing schedule, it is the two sets of algorithmic blocks that iteratively exchange extrinsic LLRs and state metrics. In the first-half iteration of the FPTD operation, the odd-indexed algorithmic blocks in the upper decoder and the even-index blocks in the lower decoder operate simultaneously during a first clock cycle. The second half iteration is performed during the second clock cycle and involves the operation of all other algorithmic blocks. This process is repeated during successive iterations, with each iteration comprising only two clock cycles. Although the FPTD requires more iterations to achieve the same BER as the SOTA Log-BCJR decoder, its benefit is that the total number of clock cycles required is reduced from hundreds or thousands to tens. A detailed comparison of the FPTD with the SOTA LTE turbo decoder in terms of its complexity per iteration, the number of decoding iterations required and the overall latency, is presented in [33] .
III. PROPOSED TURBO-CODED OFDM SCHEME
The proposed concurrent turbo-coded OFDM scheme is discussed in this section. Section III-A introduces our notation and details the proposed scheme's transmitter, which is the same as in a conventional turbo-coded OFDM scheme. Following this, the proposed concurrent detection, FFT and turbo decoding approach of the proposed receiver is detailed in Section III-B.
A. Transmitter
In the turbo-coded OFDM transmitter of Fig. 5 , the K number of message bits b u 1 = [b 1,k ] K−1 k=0 are encoded by an LTE turbo encoder. To be more specific, the vector of message bits b u 1 is interleaved to obtain the vector of interleaved message bits b l 1 . These two vectors are encoded by two identical convolutional encoders (CEs), referred to as the upper and lower encoders, respectively. The resultant parity bit vectors b u 2 and b l 2 are interleaved with the systematic bit vector b u 3 , which is a replica of the message bit vector b u 1 . The resultant bit vector is punctured or repeated depending on the code length and then output as the turbo-encoded bit vector b 4 = [b 3,k ] T −1 t=0 , which comprises T bits. The T turbo-encoded bits are then converted into N = , in order to avoid the inter-symbol interference associated with dispersive channels [19] . Finally, the resultant turbo-encoded, OFDM-modulated symbol is passed through a Parallel to Serial Converter (PSC) and transmitted using a Digital to Analogue Converter (DAC) and a Radio Frequency (RF) front end.
B. Receiver
The proposed receiver schematic is depicted in Fig. 6 , where the signal is received using a RF front end, an Analogue to Digital Converter (ADC), a CP remover, and a SPC. These components have naturally low latencies compared to the rest of the schematic, which comprises a novel cumulative FFT, a bank of N novel Quadrature Amplitude Modulation (QAM) demappers and the FPTD [33] . The operations of the cumulative FFT, the N QAM demappers and the FPTD are spread over a total of C clock cycles. Fig. 6 illustrates a 'toy' example, in which C = 4 clock cycles are used to recover K = 8 bits from N = 16 samples of the 4QAM-modulated OFDM symbol.
After the removal of the CP, the received samples of the OFDM symbol can be expressed as
where h = [h n ] N −1 n=0 is the Channel Impulse Response (CIR) and n = [n n ] N −1 n=0 is the Additive White Gaussian Noise (AWGN), which has a zero mean and a variance of σ 2 = 1/(2γ), where γ denotes the Signal-to-Noise Ratio (SNR). The corresponding FD signal obtained after the FFT operation can be expressed as
Here, for the kth element in Y, we have
where H k is the single tap channel gain of the corresponding QAM symbol X k . However, instead of waiting to receive all N samples of the OFDM symbol before performing the FFT operation, the novel approach of Fig. 6 performs the cumulative FFT during each clock cycle, while the TD samples are still being received. The N/C samples of the OFDM-modulated symbol received in each clock cycle comprises the fraction 1/C of the total number of samples N . Within the same clock cycle, these N/C TD samples are immediately forwarded to the cumulative FFT, which updates its output symbols Y by incorporating these samples. More specifically, the cumulative FFT effectively calculates an FFT of all samples received so far, while assuming zero values for the remaining samples in the OFDM symbol that have not yet been received. This is exemplified in Fig. 2 , where only N/C = 4 of the received symbols are forwarded to the culmulative FFT in the first clock cycle, with the remaining symbols that have not yet been received being set to zeros. After the interleaver, these N/C = 4 input symbols are evenly distributed among the inputs of the radix-2 FFT calculations, as demonstrated in Fig. 3 , with the other inputs taking zeros. The operation of the cumulative FFT may be demonstrated by decomposing (1) in terms of the TD samples received in each clock cycle, as follows.
y N C c+m e −j2πzm/N (7) where y N C c+m is the mth TD sample received in the cth clock cycle. This decomposition reveals that the operation of the FFT over all the N received symbols is equivalent to performing an N/C-point FFT over each set of N/C TD samples received in each clock cycle and performing a weighted sum of the results across the C clock cycles. Fig. 3 shows that if either of the two inputs of a radix-2 butterfly adopts a value of zero, then its two applied outputs adopt the value of the other input, but with different phase shifts. This observation is exploited in the proposed cumulative FFT, where the N/C TD samples received in each clock cycle are evenly distributed by the input interleaver to provide only one non-zero value among each set of C adjacent nodes. Therefore, the output of the first log 2 C stages would simply be the replicas of the non-zero values, but with different phase shifts applied. By exploiting this, only log 2 (N/C) layers of radix-2 butterflies are required in order to perform the N -point FFT operation, where multipliers are employed to shift the phase e −j2πzc/N of the N resultant symbols. Following this, adders and registers are used to accumulate the results obtained in each successive clock cycle. The proposed cumulative FFT behaves as an SPC, with each successive clock cycle providing N QAM-modulated symbols of progressively higher quality, containing a diminishing level of Inter-Carrier Interference (ICI).
Each of the N QAM demappers of Fig. 6 processes the corresponding M = 4QAM-modulated symbol by approximately modeling the ICI as additional Gaussian-distributed noise [45] . More specifically, the detection of symbols modulated using a set S of M constellation points begins by equalising the symbol that it is provided with in each clock cycle, converting it into soft LLRs. The a priori probability of the ith bit b 4,i in symbol X k ∈ S being 0, given the kth received symbol Y k , is
According to Bayes' rule, we have
For the case of the single tap channel characterised in (6), we have
where
, as will be demonstrated as follows. Here, D denotes the number of QAM symbols received in each clock cycle.
After performing the culmulative FFT on the received symbol in the cth clock cycle, we have (11) where I I I D is first D columns of a diagonal matrix I I I N , with a size of N × N and
Using the derivations of Appendix, the ICI and noise can be separated in Y z,c , which can be expressed as
where the derivation of (13b) is also provided in Appendix. Then the signal to noise power ratio can be expressed as
whereas the ICI to signal power ratio can be expressed as
Now, we obtain the relationship between the signal power and the ICI power together with the noise, as
Therefore, given the unit signal power, the variance of the ICI and noise can be expressed as
The LLR of the ith bit in symbol X k can now be expressed as
where the symbol set S 0 comprises all constellation points in S that imply 0 values for the ith bitb 4,i , while S 1 comprises the constellation points that imply a value of 1 for the ith bitb 4,i .
The max * operator [46] may then be employed to simplify the calculation of (18), according to 
Therefore, the LLR of b 4,i can be expressed as
Then, the resultant set of N log 2 M LLRs is distributed among the inputs of the FPTD by its interleaver of Fig. 6 . As described in [33] , the FPTD comprises two rows of N concurrentlyoperated Processing Elements (PEs), allowing it to process all N log 2 M LLRs provided in each clock cycle, in contrast to a conventional turbo decoder. Each PE uses registers to iteratively exchange LLRs with its neighbouring PEs in the same row, as well as with a corresponding PE in the other row, via a second interleaver. The quality of the iteratively exchanged LLRs improves in each clock cycle, until final LLR decisions are obtained for the N bits, with K being the number of message bits, using additions in the final clock cycle.
IV. VALIDATION
In this section, we validate the correctness of our cumulative FFT and ICI-aware soft QAM demapper by confirming that the resultant LLRs satisfy the consistency condition given in [47] . More specifically, two methods for measuring the Mutual Information (MI) of LLRs are proposed in [47] , referred to as the averaging method and the histogram-based method. While the histogram method computes the MI of LLRs by comparing them to the correct bit values, the averaging method does not consider the correct bit values. Instead, it assumes that the LLRs satisfy the consistency condition and that the MI can be correctly computed based on the magnitudes of the LLRs alone. If a vector of LLRs satisfies the consistency condition, then the averaging method will measure the same MI value as the histogram method, which does not assume consistency. Fig. 7 illustrates the employment of the averaging and histogram methods of calculating the MI to validate our proposed approach. The results of the comparisons are shown in Figs. 8 and 9 , where 4QAM and 16QAM are employed, respectively. As shown in Figs. 8 and 9 , both methods of measuring the MI give similar results across a variety of different E b /N 0 values, and as successively more symbols D are received. This confirms that the LLRs satisfy the consistency condition and validates the accuracy of the proposed approach. As expected, higher E b /N 0 values and more received symbols D result in higher LLR reliability, whereas higher-order QAM decreases the LLR reliability, at a given E b /N 0 value.
V. PERFORMANCE ANALYSIS
In this section, we present and benchmark the performance of the proposed concurrent OFDM demodulation and turbo decoding architecture. Fig. 10 characterises the performance of the proposed scheme and that of a benchmarker for the case of a punctured LTE turbo code, Gray-coded QAM, OFDM and a quasi-static Extended Typical Urban model (ETU) Rayleigh fading channel [48] , where K = 1376, N = 2048 for 4QAM and N = 1024 for 16QAM. Here, the concurrent receive, FFT and turbo decode approach of Fig. 1(b) employs the architecture proposed in Fig. 6 and C ∈ {128, 256, 512, 1024} clock cycles. This is compared to a benchmarker employing the serial receive, FFT and turbo decode approach of Fig. 1(a) , when employing a conventional turbo decoder. As the number of clock cycles C is increased, the Bit Error Ratio (BER) performance of the proposed scheme can be seen to converge to that of the benchmarker, proving the concept of the concurrent receive, FFT and turbo decode approach. A similar result may be observed when employing the higher-order QAM, where higher bandwidth efficiency is obtained at the cost of degraded BER performance. However, for the 4QAM modulation scheme, the proposed architecture requires up to C = 1024 clock cycles in order to closely match the performance of the benchmarker, while even more clock cycles are required for the 16QAM scheme to approach the benchmarker. A significant reduction in processing energy consumption could be achieved upon employing only C = 128 clock cycles, although this is associated with a performance loss of up to 2.5 dB, compared to the benchmarker.
In order to mitigate this performance loss, the architecture of Fig. 6 may be further refined. In a first refinement, referred to as the staggered receive, FFT and turbo decode approach, the operation of the FPTD may be staggered relative to that of the cumulative FFT of Fig. 6 . More specifically, the operation of the FPTD maybe delayed until S ∈ [0, N] symbols have been received, facilitating a gradated trade-off between latency and processing energy consumption. Fig. 11 shows that upon adopting this approach, C = 128 clock cycles and a stagger of S = 3N/8 symbols is sufficient for closely matching the BER performance of the benchmarker, when employing both 4QAM and 16QAM.
In a second refinement, referred to as the scaled concurrent receive FFT and turbo decode approach, the BER performance loss can also be mitigated by reducing the weighting of the ICIdominated LLRs provided by the soft demappers during the early clock cycles. This can be achieved by applying a gradually increasing scaling factor imposed on the LLRs in successive clock cycles according to an exponential function y = exp x. Fig. 12 shows that decreasing the weighting of LLRs provided in early cycles improves the overall BER performance by around 0.2 dB.
By designing the proposed architecture of Fig. 6 to implement the staggered and/or scaled receive, FFT and turbo decode approach using C = 128 and a clock frequency of at least 176 MHz, sub-microsecond physical layer latencies will be facilitated for applications such as low-latency capital market trading or robotic cars. In applications such at LTE, where a transmission latency of 70 μs is imposed, the operation of the cumulative FFT and the FPTD can be spread over this duration, in order to achieve a significantly improved hardware resource efficiency or processing energy consumption.
VI. CONCLUSION
In this paper, we proposed a concurrent OFDM demodulation and turbo decoding architecture employing a novel cumulative FFT technique for significantly reducing the associated processing latency. Rather than completing the reception, FFT and turbo decoding operations one after another, these three processes are performed concurrently in our proposed approach. In this way, the overall receiving, demodulation and decoding latency is only a third of that of the conventional serial architecture, which makes it promising application for URLLC. In order to allow the trade off between latency and complexity to be adjusted and to improve the associated BER performance, we also proposed staggered and scaled refinements to the proposed approach.
Our future work will investigate the performance of the proposed architecture in the case of mobile terminals such as autonomous vehicles, where time-varying channels and Doppler shift must be considered. The extension of the proposed architecture to other transmission schemes, such as LDPC-coded modulation and diverse multicarrier communications schemes will also be explored. It may be expected that these advanced techniques will enable further enhancements to the proposed technique.
APPENDIX DERIVATION OF (13)
Y z,c = H z 1 N N −1 k=0 N −1 n=0 X n W nk N d k W −kz N + N z = H z 1 N N −1 k=0 N −1 n=0 X n W (n−z)k N d k + N z = H z 1 N N −1 n=0 X n N −1 k=0 W (n−z)k N d k + N z = H z 1 N X z N −1 k=0 W 0 N d k + H z 1 N N −1 n=0,n =z X n N −1 k=0 W (n−z)k N d n + N z = H z D N X z what we expect + H z 1 N N −1 n=0,n =z X n D−1 k=0 W (n−z)k N ICI + N z noise = D N ⎛ ⎝ H z X z + H z D N −1 n=0,n =z D−1 k=0 X n W (n−z)k N + N D N z ⎞ ⎠ = D N ⎛ ⎜ ⎜ ⎜ ⎜ ⎝ H z X z + H z D N −1 n=0,n =z N −1 k=0 X n W (n−z)k N =0 − H z D N −1 n=0,n =z N −1 k=D X n W (n−z)k N + N D N z ⎞ ⎟ ⎟ ⎟ ⎟ ⎠ = D N ⎛ ⎝ H z X z − H z D N −1 n=0,n =z N −1 k=D X n W (n−z)k N + N D N z ⎞ ⎠
