This paper presents a hardware design of high throughput, low latency preamble detector for 3GPP LTE physical random access channel (PRACH) receiver. The presented PRACH receiver uses the pipelined structure to improve the throughput of power delay profile (PDP) generation which is executed multiple times during the preamble detection. In addition, to reduce detection latency, we propose an instantaneous preamble detection method for both restricted and unrestricted set. The proposed preamble detection method can detect all existing preambles directly and instantaneously from PDP output while conducting PDP combining for restricted set. The PDP combining enables the PRACH receiver to detect preambles robustly even in severe Doppler effect or frequency error exist. Using proposed method, the worst case preamble detection latency time can be less than 1 ms with 136 MHz clock and the proposed PRACH receiver can be implemented with approximately 237k equivalent ASIC gates count or occupying 30.2% of xc6vlx130t FPGA device.
tected UE through downlink connection and the UE communicates with BS after adjusting uplink timing [2] .
Several previous researches have evaluated the detection performance of several different PRACH detection strategies [4] . The researches of [5] - [7] try to reduce the computational complexity of PRACH receiving.
In this research, main subjects are the preamble detection latency time and the preamble detection performance in Doppler or frequency error existing environment. Reducing preamble detection time can help improve network latency performance. And robust detection in Doppler or frequency error existing environment is crucial for high-speed cell performance. For this aspect, we present a hardware design of low latency PRACH receiver which can detect preamble robustly in Doppler or frequency error existing environment.
This paper organized as following. In Sect. 2, we briefly introduce the PRACH of LTE system, and in Sect. 3, we describe the proposed preamble detection method with analytic formulation. In Sect. 4, we explain the hardware design of proposed low latency PRACH receiver and, in Sect. 5, we explain the implementation details and results of proposed PRACH receiver and in Sect. 6, we make conclusions.
LTE PRACH Overview
In this section, we briefly introduce the physical random access channel (PRACH) of 3GPP LTE system.
The random access preambles in LTE system are generated using Zadoff-Chu (ZC) sequence and the u-root ZC sequence, x u (n), is defined as following [8] x u (n) = exp (− jπun(n + 1)/N ZC ) , 0 ≤ n ≤ N ZC − 1 (1) where N ZC =839 is prime numbered sequence length, u = 1, 2, ..., 838 is root index of ZC sequence. The ZC sequence is a constant amplitude zero auto correlation (CAZAC) sequence and it has ideal cyclic autocorrelation properties [8] . Therefore, from a ZC sequence, multiple orthogonal sequences can be generated using simple cyclic shifting operation and these sequences are used as random access preamble. The random access preamble sequence is defined as following.
x u,v (n) = x u (n + C v ) mod N ZC (2) where v represents preamble identification number (PID), C v denotes cyclic shift for each preambles.
Copyright c 2013 The Institute of Electronics, Information and Communication Engineers
For each preambles (total 64 preambles are exist in a cell), a cyclic shift region (the length of a cyclic shift region is N CS ) is assigned. And, cyclic shift values (C v ) for each preambles are defined as
If all 64 C v values cannot be generated from single ZC sequence, change the root index (u) of ZC sequence to next value and continue C v value generation. In this case, multiple ZC sequences which have different root index, are needed for 64 preambles. However, (3) is for the case of not considering the Doppler effect or frequency error. LTE standard calls these preambles as "unrestricted set". The other set of preambles in LTE standard is called "restricted set". The restricted set is designed for high speed cell and considering the Doppler effect or frequency error. In restricted set, some cyclic shift regions are prohibited to avoid ambiguity during preamble detection (we will explain the reason in Sect. 3) and the cyclic shift values (C v ) are defined as following [1] .
where the d u value is defined as following and u −1 is multiplicative inverse of u mod N ZC .
After selecting preamble ID (by deciding C v ), the preamble is transmitted using SC-FDMA modulation procedure as shown in Fig. 1 .
Analytic Modeling of Proposed PRACH Receiver
In this section, we explain the preamble detection flow of proposed PRACH receiver using analytic formulation.
The preamble detection is accomplished using power delay profile (PDP) and the preamble transmission and detection flow is shown in Fig. 1 . At first, let assume UE transmits preamble of C v =0 (PID=0) and ignore round trip delay (T RT D =0) for now. Then the frequency domain received signal R u (k) can be represented as following. Fig. 1 The preamble detection flow of the proposed PRACH receiver.
Fig. 2
The inter carrier interference due to frequency error. For brevity, let assume h=1, no noise, and only consider dominant ICI from adjacent two subcarriers. And then, the (8) can be represented as following.
To calculate delay profile, R u (k) is multiplied by conjugate version of original ZC sequence in frequency domain.
Using following property [9] and substitute (11) into (10) .
and the power delay profile (PDP) is calculated as |τ u (n)|. The preamble detector uses the PDP data (|τ u (n)|) for preamble detection.
From (12), if =0, only D( ) term contributes to τ u (n) therefore, single impulse (signature) will be appeared at zero delay position of |τ u (n)|. But, if 0, two other impulses (fake signatures) are also appeared both sides of genuine signature with u −1 distance away as shown in Fig. 3 (a). Because of these fake signature, the cyclic shift regions (length=N CS ) in which fake signatures appear, are restricted for preamble transmission to avoid ambiguity during preamble detection. Due to this restriction, C v values for restricted set are defined as shown in (4).
If we consider arbitrary C v value (arbitrary PID), the genuine signature are move to S v position on delay profile output (|τ u (n)|) because of transmitter's cyclic shift C v value. And if we also consider practical multi-path channel and uplink timing error due to round trip delay time (T RT D ) between UE and BS, the |τ u (n)| becomes the shape of Fig. 3(b) . Due to multi-path channel, the signature becomes not an impulse but a shape of delay spared and the round trip delay appears as some delay (T RT D ) on delay profile output.
The amplitude of signatures (D( ), I ±1 ( )) are varying according to frequency error value ( ) as shown in Fig. 3 and Fig. 2 . As increases, the peak of genuine signature decreases and in other hand, fake signatures increase. Therefore, if check only one N CS cyclic shift region of genuine signature for detecting preamble, the detection performance will be degraded due to reduced peak of genuine signature. And even can't detect preamble if 1.0 and line-of-sight (LOS) environment because the genuine signature will be completely disappeared and there exist only fake signature on delay profile output as shown in Fig. 3 (c). To overcome these problem, the proposed preamble detector using the PDP combining method as shown in Fig. 3(d) . The PDP combining is conducted as following.
where, n=0,...,N ZC − 1 and v=0,...,63. After PDP combining for each v, find signature using threshold value. Once signature is found (peak value is larger than threshold), decide the v value as PID and measure round trip delay using the peak value location in a N CS cyclic shift region.
Hardware Design of Proposed PRACH Receiver
The block diagram of proposed PRACH receiver is shown in Fig. 4 . At first, time domain frequency shifter move PRACH frequency resources to f =0 Hz position so that anti-aliasing (low pass) filter extract the PRACH resources (bandwidth=1.08 MHz) before decimation. Decimation is essential to avoid huge 24576-pt FFT and with decimation factor of M = 12, the huge FFT can be reduced to 2048-pt FFT. After OFDM demodulation through the FFT, received frequency domain preamble (R u (k)) is multiplied by conjugated version of original sequence (X * u (k)) and resultant sequence goes to inverse FFT (IFFT) block to get power delay profile (PDP). The PDP output data feed to proposed preamble detector as shown in Fig. 4 and the detailed block diagram of proposed preamble detector is shown in Fig. 5 .
In Fig. 5 , the PDP data (IFFT output) is accompanied by its index output n=0,1,2,...,N FFT − 1. However, PRACH related values (C v , N CS , d u ) are designed in terms of i=0,1,2,...,N ZC − 1 range. Therefore, n index converted to i index value using following equation.
In proposed method, the S v values for each preambles are prepared before PDP output data come in. The S v values represent the starting i index values of a cyclic shift region on PDP output which correspond to each C v values of transmitter. S v values can be generated using (3), (4) and following equation. 
However, we propose more efficient method to generate S v values. The proposed method uses simple arithmetic operation instead of divider, multiplier and nested calculation of (4). The proposed method can be implemented with simple finite-state-machine (FSM) and efficient for hardware implementation.
The S v values are generated by "Sequential Sv Gen" block of Fig. 5 . And the algorithm and its graphical example is shown in Fig. 6 and Fig. 7 respectively. In Fig. 6 , variables are initialized differently for unrestricted set and both (5) and (6) cases of restricted set. And then, S v values are generated sequentially.
For unrestricted set case of Fig. 7(a) , the cyclic shift (N CS ) regions of each preambles are located consecutively except "unused" region between the first and last N CS region. The "unused" region is the remaining region after all 64 N CS regions are assigned or short region which is not large enough for one N CS region.
For restricted set, we define "segment" as some PDP region in which S v values are located consecutively as shown in Figs. 7(b), (c) and the variable segLen means the length of a segment. Figure 7(b) shows an example of restricted set (u=218, N CS =38, d u =127) which correspond to the case of (5). And segments A and B are the regions for genuine signatures and A + , A − , B + , B − segments are for fake signatures of A and B due to Doppler or frequency error. For the case of (5), we set the length of a segment as segLen=d u =127. The first N CS region for PID=0 (S v=0 ) is located at i=0 and next N CS region (S v=1 ) is located consecutively to the left (cyclically) so that appeared at the right most position of PDP output. In segment A, up to three N CS regions (S v=0,1,2 ) can be located because A − region is already exist at the next position. To avoid conflict, the N CS region for S v=3 shall be located dist2Nxt=2·d u distance away from S v=2 so that avoid A − and B + segment regions. Figure 7 (c) shows S v locations in case of u=707, d u =375, N CS =26 which fulfills condition (6) length of a segment set to segLen=N ZC -2·d u and d u is relatively larger than the case of (b).
The variable endIdx means the lower boundary of valid S v values on PDP output and dist2End is the minimum dis- tance from S v to endIdx for one more N CS region fit into. ulimit and llimit are the upper and lower limit of a segment. After end of S v generation, the variable nS v shall contains the number of preambles contained in a PDP output.
There is 838 values of u and 16 values of N CS . The S v values are generated differently for each combinations of u and N CS values. Therefore, preamble detector should generate S v values whenever the u value is changed during preamble detection.
The generated S v values are saved to "SvReg" block of Fig. 5 . We need only 18 registers (not 64) to save S v values because the maximum value of nS v is only 18 in restricted set as shown in Fig. 8 . Figure 8 shows the nS v distribution over all root indexes for each N CS value of restricted set. For example, in case of N CS =15, total 796 root indexes are available (fulfills (5), (6) ) and from 9 to 18 S v values can be generated from a root index. Therefore, 18 registers are enough for "SvReg" block to save all generated S v values. For unrestricted set, nS v = N ZC /N CS and maximum value of nS v is 64. However, in proposed method, only two S v values (the first and the last) are saved to "SvReg" in unrestricted set. From the first and last S v values, find "unused" region and the other N CS regions can simply be found by counting the PDP output as shown in Fig. 7(a) .
The "SvReg" block generate S v values as well as its fake signature regions S
values is compared simultaneously with the i index value at "Matching" block to find the starting position of a N CS region. Therefore, the "Matching" block consists of total 3 · 18 = 54 constant comparators. Registers in "SvReg" block are initialized with a value greater than N ZC and the generated S v values are overwritten to registers. Therefore, among outputs of "SvReg",
values will never be matched with i because the initial value is always greater than i. Therefore, "Matching" block only compares the generated and saved
If any one of three
− v values of a v value is matched with i, "Matching" block output the matched v value to "opCode" block as v matched as shown in Fig. 7(c) . "opCode" block has 18 opcode registers to indicate "write", "accum", "output" operating code for combining operation as shown in following.
opcode [v] = {"write", "accum", "output"} v = 0, 1, 2, ..., 17
All opcode [v] is initialized to "write" and whenever v matched is asserted, the corresponding opcode (opcode[v matched ]) is sequentially changed to "accum" and "output" as shown in Fig. 7 ('w'≡write, 'a'≡accum, 'o'≡output) . And the opcode[v matched ] goes to the PDP combining logic as "cmd" and control the PDP combining operation. The PDP combining logic sums the N CS region of genuine signature and its two fake signature regions. For this, when the first N CS region among three regions (starting from S v , S + v , S − v ) comes in from IFFT, the "cmd" will be set to "write" and the PDP data are written to "combMem" block directly. For second N CS region, "cmd" set to "accum" so that PDP data will be accumulated with "combMem" output (accumulated with first N CS region) and write-back to "combMem" with in-place manner. At the time of third N CS region comes in, "cmd" set to "output" and third region is accumulated with "combMem" output and feed to "peakSearch" block. To accomplish these combining operation, the address (addr) of "combMem" block is generated as following.
where p value is a counting value synchronized to n index and has range from 0 to N CS ,N FFT − 1. With above address generation, the three N CS regions which belong to same v value are mapped same memory area. Therefore, the PDP combining operation can be accomplished instantaneously.
The minimum memory depth required for "combMem" can be derived as following.
The maximum value of N CS ·nS v = 276 is occur at N CS = 46 and nS v = 6 therefore, minimum required memory depth is depth min = 674. For unrestricted set, only two S v values (the first S v=0 , and the last S v=nS v −1 ) are saved into "SvReg" block. After second match (S v=nS v −1 ), "Matching" block will count the transition of i value and repeatedly generate v matched at every N CS transition of i value. The opcode is always set to "output" and PDP combining is not used in unrestricted set and PDP data directly feed through without accumulation.
"peakSearch" block check the N CS regions when "cmd"="output" and find peak which is larger than predefined threshold value as shown in Fig. 5 . If valid peak is found, measure round trip delay time between UE and BS using peak position in a N CS region. And PID value is decided by
where nPid is total number of PIDs which is already processed before current PDP output as shown in Fig. 7(c) . Using proposed method, the whole preamble detection procedure is completed immediately after PDP output data is arrived. Therefore the proposed method can minimize the preamble detection latency of PRACH receiver.
Implementation and Performance of PRACH Receiver
The proposed PRACH receiver is implemented with hardware to investigate its feasibility and hardware resource usage. The structure of PRACH receiver is already shown in Fig. 4 . The implemented PRACH receiver consists of 4 pipelined stages as shown in Fig. 4 and Fig. 9 . The stage0 is executed only once for each PRACH subframe and it consists of time domain frequency shifter, decimation filter and FFT processor. A 12 bit ADC is used and a 35-tap FIR filter is used for decimation filter. The filter is implemented efficiently using sub-expression elimination method [11] . The FFT/IFFT processor is 2048-pt pipelined FFT/IFFT processor so that it can handle back-to-back continuous input data. As a results of stage0, the received frequency domain preamble sequence is stored into buffer as shown in Fig. 4 .
In stage1, the next valid root index (u) is searched before IFFT input data generation, Finding valid root index means checking the condition (5) and (6) while changing logical root index as defined in LTE standard [1] .
The IFFT input data generation in stage2 is consists of frequency domain ZC sequence generation (length=N ZC ), buffer controlling to read out received ZC sequence, zeropadding (zero length=N FFT − N ZC ) and complex multiplication. During stage2, frequency domain ZC sequence generation block is implemented efficiently using [9] .
The stage3 is consists of pipelined IFFT processor and preamble detector. The preamble detector can detect preambles from PDP data without any delay therefore, it enables IFFT processor to generate multiple PDP data back-to-back continuously so that maximize PDP generation throughput.
The stage1, 2 and 3 may be repeated multiple times to accomplish preamble detection for each PRACH subframe.
To increase throughput and reduce the detection latency, the implemented PRACH receiver uses two differ- Fig. 9 The proposed PRACH receiver pipeline timing diagram. ent clock domains. The stage0 uses sample rate clock CLK = 30.72 MHz and stage1, 2 and 3 use higher clock frequency CLKX. The detection latency (N latency ) of implemented PRACH receiver can be calculated as following.
where N latency is the number of clock cycle from the end of PRACH input subframe to the end of preamble detection as shown in Fig. 9 and nRoot denotes total number of root indexes which is necessary to generate all 64 preambles. As shown in above equation, the latency depends on nRoot and the worst case of nRoot value is 64 as shown in Fig. 10 . Figure To compare detection latency performance, we consider digital signal processor (DSP) based preamble detection method. In DSP based method, we assume DSP control a hardware accelerator to generate the PDP data of Fig. 4 and the PDP data are stored to a buffer memory. And then, DSP accesses the buffer memory data and processes preamble detection. For simplicity, we assume the DSP operates much higher clock frequency than buffer memory and DSP internal execution time is very small so that we consider the buffer memory access time and PDP generation time as the preamble detection latency. The detection latency of DSP based method can be calculated as following.
where, N PDPcycle =3 for restricted set which represents 3 samples access time for PDP combining and N PDPcycle =1 for unrestricted set. The detection latency for proposed method and DSP based method is calculated and shown in Fig. 11 . For each N CS values, the detection latency of proposed method is approximately 2.7-3.8 times shorter than DSP based method. The worst case latency delay of proposed method is occurred when nRoot=64 and in this case, stage1, 2 and Fig. 10 The nRoot distribution for each N CS of restricted set. =4240 and N FFT =2048. Therefore, the maximum latency is given by N latency.max = 4240+2048·64 = 135312. A subframe in LTE system has 1 ms time duration and 30720 samples (T S =1/30.72 MHz). Therefore, if stage1, 2 and 3 uses CLKX=136 MHz (=135312/30720·30.72 MHz) clock frequency then preamble detection can be done within a subframe. The worst case detection latency is summarized for several clock frequencies and shown in Table 1 . The hardware resource utilization results of implemented PRACH receiver is summarized in Table 2 and Table 3 .
In addition, we simulate the miss detection ratio (MDR) using bit-accurate C-model of implemented PRACH receiver and the results are shown in Fig. 12 . During MDR simulation, the following cases are defined as miss detection [12] . The miss detection cases are 1.Detecting different preambles, 2.Not detecting transmitted preamble, 3.Detect preamble but timing estimation error is larger than 1.04 μs in AWGN or 2.08 μs in ETU70 channel. The MDR simu- Fig. 12 The miss detection ratio (MDR) simulation of implemented PRACH receiver using bit-accurate C-model. lation is performed on AWGN channel and ETU70 channel with several frequency error values. The ETU70 channel is a multi-path (9path) fading channel with maximum Doppler frequency of 70 Hz due to UE's mobility [13] . At first, examine the simulation results of AWGN channel of Fig. 12 . If there is no frequency error ( f e =0 Hz), MDR performance of restricted set and unrestricted set are same and there is no PDP combining benefit.
In case of f e =625 Hz ( =0.5), both genuine and fake signatures are appeared on PDP output as shown in Fig. 3(b) . Because the peak of genuine signature is reduced due to frequency error (as mentioned in Sect. 3), the detection performance is degraded as shown in Fig. 12 . But, using PDP combining, detection performance is improved dramatically and preambles can be detected robustly even with high Doppler effect or frequency error.
If f e =1340 Hz ( =1.07), the genuine signature is disappeared and only fake signature exist (as shown in Fig. 3(c) ). This phenomenon is occurred when frequency error is almost same as subcarrier spacing ( f e = Δ f sc =1.25 kHz) or severe Doppler effect in line-of-sight (LOS) environment. As shown in Fig. 12 , the preamble cannot be detected unless PDP combining.
For multi-path fading channel (ETU70) and frequency error, PDP combining is also helpful to improve preamble detection performance. Using PDP combining, the MDR performance of ETU70 channel is also improved as shown in Fig. 12 .
The LTE standard requires that the detection probability should be P d > 99% for SNR levels listed in Table 4 [12]. The Table 4 describes the minimum SNR requirements to achieve 99% detection probability. The minimum SNR values are also drawn in Fig. 12 . In Fig. 12 , all MDR results of implemented PRACH receiver, using PDP combining, fulfills these requirements on AWGN channel and ETU70 multi-path fading channel environment.
Conclusions
We propose a random access preamble detection method for 3GPP LTE uplink system and we implement a PRACH receiver that incorporates the proposed method. The implemented PRACH receiver can maximize the PDP generation throughput using back-to-back continuous PDP generation and reduce preamble detection latency using instantaneous preamble detection method. The proposed preamble detector can detect all existing preambles directly and instantaneously from IFFT output while conducting PDP combining. The PDP combining is very effective for robust preamble detection when frequency error (or Doppler effect) existing. The implemented PRACH receiver has less than 1 ms of worst case detection latency time with 136 MHz clock. And it can be implemented with occupying 30.2% SLICEs of XC6VLX130T FPGA device or with 237k equivalent gates using 0.13 μm ASIC technology.
