Abstract-Rapidly acquiring the code phase of the spreading sequence in an ultra-wideband system is a very difficult problem. In this paper, we present a new iterative algorithm and its hardware architecture in detail. Our algorithm is based on running iterative message passing algorithms on a standard graphical model augmented with multiple redundant models. Simulation results show that our new algorithm operates at lower signal to noise ratio than earlier works using iterative message passing algorithms. We also demonstrate an efficient hardware architecture for implementing the new algorithm. Specifically, the redundant models can be combined together so that memory usage can be reduced substantially. Our prototype achieves a combination of performance and complexity not possible with traditional approaches.
I. INTRODUCTION
In an ultra-wideband (UWB) system, the source signal energy is spread over a bandwidth many times larger than its original bandwidth during transmission. As a result, the transmitted signal has a very low signal to noise ratio (SNR) and is completely buried in noise. Though this is a desirable property which minimizes interference to other users and makes it very difficult for an unintended receiver to detect and intercept the signal, it also presents the receiver designers with the very challenging problem of detecting and acquiring the signal at very low SNR.
Pseudo-random or pseudo noise (PN) sequences play an important role in a UWB system. They are periodic sequences with long period in practical systems. In a direct sequence ultra-wideband (DS/UWB) system, the transmitted signal is a train of very narrow pulses with polarities determined by the product of a PN binary sequence and the incoming binary source data sequence. For security reasons, it is often desirable to have PN sequences of very long period, so that to an unintended receiver over a short time interval, the sequences appear to be aperiodic and completely random [1] , [2] .
For a UWB receiver, the first step of demodulation is to de-spread the signal. In a DS/UWB system, this is achieved by multiplying the incoming samples by a local replica of the PN sequence. Therefore, the receiver must determine the unknown PN code phase embedded in the transmitted signal by analyzing the data collected from a short (compared to the PN code period) observation window so that it can synchronize the local replica. This is termed PN acquisition and will be the focus of this paper. Once the code phase is acquired, the receiver maintains the PN code synchronization through code tracking.
Traditionally, PN acquisition is achieved by searching explicitly over possible code phases. Reference signals corresponding to different code phases are correlated with the received signal and the one with the largest correlation is selected. For a DS/UWB system, the receiver estimates the arrival time of the pulses (i.e., the frame epoch), samples the incoming noisy signal and performs PN acquisition on these samples. The above process is repeated until acquisition is declared. Letting T f be the pulse repetition period (frame time) and T p be the pulse width, there are Tp range from 100 -1000, thus the receiver has to perform up to 100-1000 PN acquisitions to locate the correct frame epoch. This number may be reduced if multi-path delay spread is exploited [3] . As explained in [4] , [2] , for a UWB system, the PN acquisition has to be completed quickly. If it is too slow, the correct code phase may never be acquired because the frame epoch may change due to timing drift before the receiver finishes evaluating the current frame epoch estimate. In this paper, we focus on fast PN acquisition and frame acquisition is not considered.
At one extreme of the traditional approaches to PN acquisition, all correlations are completed in one observation window. This is the full parallel search approach and it offers the best performance. At the other extreme, only one correlation is formed every observation window and acquisition is declared if a certain threshold is exceeded. This is the serial search approach. Hybrid search (i.e., correlating against only a subset of PN code phases in every observation window) is a compromise between the two extreme cases. Practically, parallel search is too expensive to implement for reasonably long sequences. As a result, serial and hybrid search are the only available options. As an example, the PN acquisition module of the UWB prototype in [5] was built using a hybrid approach to acquire a short PN sequence of period 128. However, for long PN sequences, serial search is often too slow and hybrid searches, at best, provides only a linear tradeoff between the speed and cost [6] , [1] .
Recently, iterative message passing algorithms (iMPAs) similar to the decoding algorithms for low density parity check codes (LDPC) and turbo codes were proposed in [4] , [2] for fast PN acquisition in both direct sequence spread spectrum (DS/SS) and DS/UWB systems. Similar approaches have also been proposed in [7] , [8] , [9] . Our exposition most closely follows that of [2] . These iterative algorithms offer the speed of parallel search and acquisition performance similar to that of serial search at short block lengths. Unlike parallel search, it is practical to implement the proposed algorithm in hardware to acquire PN sequences with long period. There are two drawbacks of the algorithm proposed in [4] , [2] . First, the algorithm converges slowly at low SNR. Second, the performance of the algorithm does not scale well with observation length. Specifically, doubling the observation window length lowers the operating SNR of the traditional approaches by 3 dB but only by 1-2 dB for the proposed algorithm.
In this paper, we present a new improved iterative message passing algorithm and its hardware architecture based on the algorithm proposed in [2] . Specifically, we introduce multiple redundant models to mitigate the aforementioned drawbacks discussed in [4] , [2] . The new algorithm converges faster and operates at lower SNR without increasing the hardware complexity. We will also demonstrate how to aggregate these multiples models into a single model to reduce memory usage. In our hardware prototype, the spreading sequence is of period 2 22 − 1. Rapidly acquiring such a long sequence is impractical by both serial and parallel search, but the logic design based on our architecture can be easily fit into a small field programmable gate array (FPGA).
The remained of this paper is as follows. In Section II, we introduce the theory of operation. We then proceed to discuss various architectures for the main components of our module in Section III. Section IV gives a detailed account of our hardware implementations as well as various techniques we used in the optimization. Section V concludes the paper and gives directions for future work.
II. THEORY OF OPERATION

A. Maximal-length Sequences
A maximal-length sequence or m-Sequence is a linear feedback shift register (LFSR) sequence which has the maximum possible period for an r-stage shift register [10] . As its name implies, an m-Sequence x k can be generated by an r-stage linear feedback shift register structure as shown in Fig. 1 . When the registers are loaded with any non-zero values, the generated sequence will cycle through all 2 r − 1 possible non-zero states before repeating (i.e., its period is 2 r − 1). Mathematically, the sequence structure can be expressed as
where g 0 = g r = 1, g k ∈ {0, 1} for 1 < k < r and ⊕ is the modulo-2 addition. The generator polynomial is
where D is the unit delay operator [10] . Given r, there is only a very limited set of g k values that generates an m-Sequence. Because of their excellent auto-correlation and cross-correlation properties, mSequences are widely used as spreading sequences in spread spectrum systems [10] , [1] .
Linear feedback shift register (LFSR) structure for m-Sequence generation.
B. Signal Model
For a DS/UWB system, a standard model for acquisition characterization is [2] , [1] 
where z k , 0 ≤ k ≤ M − 1, is the noisy sample received by the acquisition module, x k , 0 ≤ k ≤ M − 1, is the spreading m-Sequence, E c is the transmitted energy per pulse and n k is additive white Gaussian noise (AWGN) with variance N0 2 . We also assume that x k is generated by an r-stage LFSR and r M 2 r − 1. This is a much simplified model which assumes no data modulation and does not include the effect of jamming, oversampling, etc, but it is widely used in the literature to benchmark the performance of PN acquisition algorithms. 1 The goal of the acquisition module is to estimate x k based on z k 0 ≤ k ≤ M − 1 for a given frame epoch estimate and decide whether the frame epoch estimate is correct. In our design, we obtain the estimate of x k , denoted byx k , by running an iterative message passing algorithm. Becausex k has to be consistent with (1), once r consecutivex k are obtained, the rest of the sequence is determined by extrapolating the estimate by (1) . As the last step, z k is correlated withx k , 0 ≤ k ≤ M − 1 to check whether the correlation threshold is reached.
C. Iterative Message Passing Algorithm (iMPA) for Fast PN Acquisition
In traditional PN acquisition approaches, the received sequence z k is correlated with up to 2 r − 1 PN sequences generated by different x 0 , x 1 , ..., x r−1 combinations for the whole observation window and the algorithm chooses the phase corresponding to the highest correlation. In terms of computation complexity, the main difference between parallel search and serial/hybrid search is whether all correlations are completed every observation window. Since each correlation requires M − 1 additions, the computation complexity is of O(M · 2 r ) for all of these traditional approaches. For parallel search, all valid configurations of x k are correlated, we can therefore interpret it as maximum likelihood (ML) decoding of x k from (2) and serial/hybrid search as approximations to ML decoding. Based on this observation, we formulate the PN acquisition problem as a decoding problem and apply an iterative message passing algorithm similar to turbo code [11] , [12] or LDPC decoding [13] , [14] , [15] .
Since the inception of turbo codes, iterative message passing algorithms have been widely studied. They can be easily derived by constructing the corresponding graphical models for the system and applying a standard set of rules. It is well understood that if the graphical model has no cycles, the algorithm is equivalent to maximum likelihood decoding. Otherwise, the algorithm is heuristic and sub-optimal [16] , [17] , [18] , [19] , [20] , [21] . However, it is often a good approximation to maximum likelihood decoding and offers near-optimal performance as in the case of turbo code and LDPC decoding.
In practical applications, cyclic graphical models are chosen for low complexity decoding. The graphical models chosen have a significant impact on the performance of the algorithm.
Heuristically, a good model should neither have short nor regular cycles [16] , [22] , [23] . Several graphs corresponding to the same generator polynomial Fig. 2 ] and each implies a different decoding algorithm. For a binary variable X, the message passed (i.e., soft information) in a cyclic graph is an approximation of the negative log-likelihood ratio − log P r(X=1) P r(X=0) [21] . In our case, in each iteration, the algorithm successively updates messages and decisions are made by comparing a decision message M dec to 0 where M dec is an approximation of − log
The absolute value of M dec can be interpreted as the confidence of the decision. If the algorithm converges, M dec will stabilize after certain number of iterations indicating some level of confidence in the decisions.
A detailed discussion of iterative message passing algorithms is beyond the scope of this paper. In the remaining sections, we consider acquiring the m-Sequence with generator polynomial g(D) = D 22 + D 1 + D 0 and only the details relevant to our example are presented. Interested readers can refer to [21] , [24] , [16] , [11] , [17] , [25] for further details.
Similar to the polynomial
, our polynomial can also be represented by several graphical models with two shown in Fig. 2 (a) and 2(b). Decoding algorithms based on both models offer similar performance as in [2] and suffer the same problems. The slow convergence experienced by the algorithms is similar to that of LDPC decoding and can be attributed to the weak constraints and the flooding activation schedule. The SNR scaling problem is attributed to the existence of regular cycle structures in the graphs in [2] . Qualitatively speaking, this is a "bad" graphical model to apply standard iterative message passing algorithm.
The problem is tackled in [2] by inverting the signs of the set of messages corresponding to the least reliable decisions and rerunning the algorithm if acquisition fails. This approach does improve sensitivity, but it still requires many iterations. This motivates us to find a better graphical model on which we can apply the standard iterative message passing algorithm and is more amenable to hardware implementation.
D. Graphical Models with Redundancy
To improve the performance of the iMPA, we introduce a new decoding graph for constructed using multiple graphical models each of which fully captures the PN code structure. In this sense, the model has redundancy. This is equivalent to adding redundant parity checks to the standard parity check matrix. The technique is also applied in soft decoding of some of the classical codes [26] , [27] , [28] . Fig. 3 shows the special case of using two models. Each of the subgraphs is based on a different generator polynomial to the same m-Sequence. Mathematically, we introduce different non-primitive polynomials to generate the same sequence. For example, let x k be the sequence generated by
we have the following equations:
Adding (3), (4) and (5) together, we have
also generates the same sequence. The argument can be easily extended to show that th order auxiliary models as the n th order model. Our decoding graph for an n th order model is formed by constraining the output of primary model and each of the i th order auxiliary models 1 ≤ i ≤ n to be equal. As an example, the graph of the 2 nd order model is shown in Fig. 3 . The performance improvement by combining multiple models is shown in Fig.  4 . Even though each individual auxiliary model produces very unreliable decoding decisions, combining them improves the convergence behaviour dramatically. We gain around 1 dB gain for each additional auxiliary model introduced. Only 10 iterations are required for practical convergence for a 5 th order model. Our multiple model algorithm also works for other mSequences. As a comparison, Fig. (5) shows the performance for
where the curve for algorithm in [2] is also included.
Our baseline algorithm is summarized in Algorithm 1. The complexity of both the decoding and correlation operations is of O(M ), therefore our algorithm is also of O(M ) complexity. There is substantial complexity reduction compared to the traditional approaches. Also, our new algorithm offers better performance with no additional complexity compared to the approach in [2] since we reduce the number of iterations dramatically.
III. HIGH LEVEL DESIGN FOR THE ITERATIVE DECODER
In this section, we present the hardware architecture of the basic building blocks in our iMPA algorithm. Assuming using an n th order model, the pulses are decoded by n different models during each iteration. The hardware module that performs the iterative message passing algorithm for each auxiliary model is an iterative decoder. 
As a reference, the acquisition performance by running the [2] algorithm is marked as " [2] , 100 iter.".
run the iterative message passing algorithm to get M dec , the algorithm can be based on different graphical models such as Fig. 3 and Fig. 10(c) ; 
A. Forward Backward Algorithm Based Iterative Decoder
The basic building block in our algorithm is an iterative decoder that decodes the sequence generated by g(D) = D 22 +D 1 +D 0 . We have two hardware architecture candidates: one based on Fig. 2(a) and another based on Fig. 2(b) . Simulation shows that both architectures performs similarly (the difference in sensitivity is less than 0.3 dB).
If we choose the Tanner graph representation as shown in Fig. 2(a) , the number of messages needed to be saved in each iteration is the number of edges in the graph. Therefore, the minimum storage requirement is 3M messages.
Alternatively, if we base our decoder on a Tanner-Wiberg
as a combination of a 2-state FSM, a broadcaster and a delay block. graph [16] with hidden variables introduced as shown in Fig.  2(b) , we have a more memory efficient hardware architecture. Equations similar to [2, (23) ] to [2, (29) ] can readily be obtained from this graph. This graph is an explicit index diagram [25] , [21] . For readers not familiar with TannerWiberg graphs, the update equations may be more easily explained by considering Fig. 6(a) which decomposes the sequence generating LFSR structure into three parts: one 2-state g(D) = D + 1 finite state machine (FSM) 2 , one delay block (D 21 ) and one broadcaster (i.e., an equality constraint). Applying the standard iterative message passing rules [25] , [21] , we derive the decoding graph (Fig. 6(b) ) by replacing each component by a soft-in soft-out (SISO) module which performs the a-posteriori probability (APP) decoding [29] .
The relationship between Fig. 6(b) and Fig. 2(b) is that Fig.  2(b) is an explicit index diagram and Fig. 6(b) is an implicit index diagram where the time index is hidden in the graphical representation [21] , [25] . The associated iterative processing is the same in both cases.
Let MI[i] and MO[i] be the input and output messages with ports defined in Fig. 6(b) Fig. 6(b) . The update equations for each iteration are:
From the above equations, we can see that the FSM SISO requires two types of memory. The first type is for storing the 2M messages passed between the g(D) = D + 1 SISO and the broadcaster SISO. Their values are updated based on the results from the previous iteration. The second type is for storing the FSM state metrics F k and B k , which are recalculated during every iteration. In other words, the FSM state metric memory can be reused once operations in the current iteration are finished. Therefore, we do not need to store B k if MO[·] are updated immediately once both F k and B k+1 become available. In Section IV, we will show that the state metric memory can be reduced substantially by updating the state metrics segment by segment to reuse the memory within the current iteration. If the segment size is M/8, the total memory requirement becomes M/8 state metrics + 2M messages which is substantially less than the 3M messages requirement based on 2(a). For low data rate applications, the transistor count for our circuit is dominated by memory instead of logic. Therefore, the architecture shown in Fig. 6(b) is preferred because of its lower memory usage.
In Fig. 6(b) , we show one type of activation schedule, the 2-state FSM SISO completes the message update, sends them to the broadcaster, then the broadcaster updates and returns the messages. This completes one iteration.
B. Forming an n th Order Decoder
Once we have all the auxiliary model decoders ready, forming an n th order model decoder is straightforward. We only need to form an additional broadcaster (equality) constraint and the decoding architecture follows directly by applying the standard iterative message passing rules as shown in Fig. 7 .
If we only consider a 2 nd order model and choose the SISO structure to be of the type Fig. 2(a) , then Fig. 7 is equivalent to Fig. 3 . The memory requirement equals to 6M messages which is the sum of the memory requirement for each SISO.
C. Simplification of an Auxiliary Model Decoder Using Index Partitioning
An advantage of using auxiliary models as defined in (6) is that all auxiliary decoders can be constructed using the
decoder. This is achieved by index partitioning on the output of the higher order model. IV. HARDWARE ARCHITECTURE In this section, we consider the case of decoding the PN sequence g(D) = D 22 +D 1 +D 0 over an observation window of M = 1024 using the 2 nd order model architecture. The block diagram of our acquisition module is shown in Fig. 9 .
A. 4-State FSM Decoder
As shown in Fig. 9 , instead of using 
, we combine the two models together using a single 4-state FSM as shown in Fig. 10(a) . The new FSM captures all the information of the original FSMs and lowers the memory requirements from 4M messages plus state metrics to approximately 3M messages plus state metrics as demonstrated below. Moreover, by using a single FSM, we save routing resources by lowering the bandwidth requirement for the channel metrics (M ch [k] = z k ) memory since it is now accessed only by one FSM-SISO instead of three FSM SISOs. Using the 4-state FSM does require more logic in the FSM SISO implementation, but this increase is justified by the the additional savings in memory and routing.
MI [1] MO [1] = SISO 
01 (1) 1 1 1 01 (1) 10 (2) 0 1 0 01 (1) 11 (3) 1 0 1 10 (2) 00 (0) 0 0 1 10 (2) 01 (1) 1 1 0 11 (3) 10 (2) 0 1 1 11 (3) 11 (3) 1 0 0 Our 4-state FSM decoder is also based on the forward backward algorithm. We define the state as S k = {x k−1 , x k } and the corresponding decoder is shown in Fig. 10(b) . Again, this is an implicit index digram. The explicit index diagram (i.e., the Tanner-Wiberg graph) is shown in Fig. 10(c) . The state transition table is shown in Table I and the messages passed are shown in detail in Fig. 10(d) .
The update equations are obtained by applying the standard message passing rules [21] on either Fig. 10(b) or Fig. 10(c) . They are listed from (14) to (30) in the appendix.
Simulation results shows that the 4-state FSM decoder implementation improves the performance by 0.2 dB in
Ec N0
as compared to the three 2-state FSM implementation.
We can continue to combine multiple auxiliary models to form a single FSM. For example, we can implement a 3 rd order model using a 16-state FSM. However, the exponential growth in state metric memory may outweigh any savings in the message memory for larger n.
B. Forward Backward Algorithm Architecture for Multiple Index Segments
To reduce the internal FSM state metric memory, we divide the observation window into multiple segments and run the forward backward algorithm (FBA) segment by segment. This is a standard approach for implementing the Viterbi and turbo decoders [30] , [31] .
In our prototype system, we divide the observation window (1024) 
MO [2] (b) Corresponding iterative decoder for the 4-state FSM encoder. The circled number is the activation order. This is an implicit index diagram. 
4-state trellis constraint
Hidden variable s k = {x k-1 , x k } Equality constraint (variable node) = (c) Tanner-Wiberg graph for the 4-state FSM. This is an explicit index diagram. do
The problem is solved in [30] and [31] by running the backward unit for an additional "warm-up" period. The approach is motivated by the fact that the backward state metric at the segment boundary can be well approximated by starting a backward state recursion just several constraint lengths away. Excluding the warm-up, (i.e., setting B 128 [i] = 0) will incur a loss of around 0.25 dB in E c /N 0 . To run a design using the warm-up approach at full-speed, an additional backward unit is required so that one unit warms up while the other is doing the update [30] , [31] . The additional unit can be saved if we do not use the warm-up approach but instead copy the B 128 [i] values from the previous iteration. This is feasible because the warm-up period is only required if we are trying to approximate an FBA-SISO in isolation. For an iterative system, starting the backward recursions based on earlier iteration values is equivalent to a change in the activation schedule for the iMPA on the cyclic graph, and as such does not significantly affect the performance. This is a known architecture for implementing iterative decoders with forwardbackward based SISO decoders (e.g., see [32] , [33] , [34] ). Once both the forward and backward state metrics become available, LI 0 k , LI 1 k , RI k and M dec [k] are computed and the FSM state metric memory is released immediately. The processing pipeline is shown in Fig. 11 which shows the update sequence as well as the corresponding memory access.
C. Bit Width
The bit widths in our system are determined by simulations in two steps. First, we fixed LI 0 k , LI 1 k , RI k to be of 16 bits and determine that 4 bits of ADC output is sufficient. Compared to floating point, there is a performance loss of only 0.2 dB.
The performance for various bit width combinations is shown in Fig. 12 . For each ADC bit width, we have optimized the scale q that sets the ADC dynamic range (ADC out = quantize(q · z k )) for performance. For a 4-bit ADC, q opt is found to be 1.65 by simulation. As a reference, we also show the performance for the standard mid-point loading q = 3.5 when the ADC is of 4 bits.
The second step is to determine the bit width for the messages LI 0, LI 1 and RI. This is necessary since their values may grow as the decoder iterates. To avoid using excessive bits for storage, we have to clip them after each (FBA/=) SISO activation. As shown in Fig. 12 , 5 bits are sufficient for our application when the ADC bit width is 4.
To determine the bit width for the state metric, we rely on the fact that for a given k, we are only interested in the difference between F k [i] 0 ≤ i ≤ 3, not their values. Therefore, we only need the bit width to be big enough for the differences. If we subtract
the differences (i.e., the normalized F k [i]) can be shown to be bounded between -128 to 127 for 5-bit messages by an argument similar to the ones in [35] and [21] . As a result, it can be represented by 8 bits. This additional 1-bit approach is commonly used in Viterbi decoders and is proven to be correct with two's complement arithmetic [35] , [21] .
D. Partitioning the Memory into Banks
In our prototype design, we have several modules concurrently accessing memory. By carefully partitioning the memory, contention can be avoided without the use of multiport memory. The access pipeline for the above memories is shown in Fig. 11 .
For the message memories (LI 0, LI 1 and RI), we divide them into two banks of 512 entries. One bank is for the odd FBA segment and the other for the even FBA segment. By this arrangement, there are at most 2 concurrent accesses and we can implement LI 0, LI 1 and RI using 2-port memories.
For the FSM state metric, the forward unit writes to the memory while both the backward and LI 0, LI 1 RI update unit read the same data from the memory. As a result, we only need a single bank of 2-port memories.
The channel metric memory is divided into two banks each comprising 1024 entries. The ADC and the acquisition module always work on different banks. By subdividing each bank into two sub-banks, one for the FBA odd segment and the other for the FBA even segment, there are at most two simultaneous accesses to the same segment. Therefore, the channel metrics can also be stored using 2-port memories. In order to reuse the state metric memory once the backward metric is computed, the FSM state metrics are stored in the physical memory in reverse order for even segments. For example, the even segment F 128 to F 255 are stored in the state metric memory [127:0] while the odd segment F 0 to F 127 are stored in the state metric memory [0:127]. The details are also shown in Fig. 11 .
For design simplicity, we used 2-port memories in our prototype implementation since it is free in our target device (Xilinx Virtex II FPGA). However, the design can be easily ported to single port memory only architecture by doubling the bus width and time division multiplexing the access.
E. Verification Unit
Our verification unit, shown in Fig. 13 , consists of two parts, a PN sequence extrapolation unit and a correlator unit. The extrapolation unit extends the 22-bit PN estimate it receives to the whole observation window. The correlation unit then correlates this sequence with the channel metric. To improve efficiency, the correlator output is checked every M 4 pulses and it must exceed the check point threshold before continuing. If the final correlation value exceeds the final threshold, acquisition is declared. The final threshold is chosen to be 0.65 · q · 1024, which was found by simulation. This yields good acquisition performance as shown in Fig. 12 and the frequency of false alarm is 0 in 5000 trials when the signal is absent.
F. Hardware Implementation
We implemented the architecture using Verilog HDL. The code is synthesized by Synplicity, then mapped by Xilinx Foundation to a Xilinx Virtex 2 device (XC2v250-6). The number of bits implemented in block RAM is 28160, the number of 4-input LUTs used is 1621 and the number of slices used is 1039. The design can run at 73 MHz. These figures show that memory is the main component of the circuit and justify our decision to trade off logic for memory reduction.
Our baseline design can decode
pulses per second. Assuming a 60 MHz clock, our prototype generates a PN code phase decision every 15 60 M Hz · 1024 = 2.56 µs. The decode process has to be repeated for each frame epoch estimate until the correct frame epoch is found. Assuming the frame time T f = 250 ns (i.e., pulse rate = 4 Mpulses/s) and pulse width T p = 1.6 ns, the approximate average acquisition time of our prototype system is T acq = 2.56 µs · As a comparison, to achieve the same average T acq , hardware based on parallel search and running at the same frequency requires approximately 5.6×10
5 correlators, 5.6×10 To further lower T acq , we can use parallel FBA architectures (i.e., instantiating multiple forward and backward units to process multiple data segments in parallel). We expect that the increase in logic will be approximately linear when the speed up factor does not exceed 8 because we already divide the observation window into 8 segments in our iterative decoder and each of them can be run in parallel. For lower speed applications, our design can be further simplified to using single port memory and running the update sequentially. Such a design can save in the number of adders and reduce the routing resources. Therefore we expect the logic gate count will scale linearly for target pulse rate varies from 500 kpulses/s to 32 Mpulses/s.
Our design can also be directly extended to operate at even lower SNR. This requires adding auxiliary model decoders as well as memories for saving the messages from the additional decoders. Since a 6 th order model is approximately three times more complex than a 2 nd order model, we estimate that the operating E c /N 0 can be lowered to -13 dB by tripling the gate count or alternatively, increasing the acquisition time by 3 times and tripling the message memory.
V. CONCLUSION In this paper, we present a new hardware architecture for fast PN acquisition in UWB systems based on iterative message passing on a graphical model with redundancies. Our new algorithm improves sensitivity significantly via the introduction of multiple redundant models. Hardware based on the algorithm is economical to implement and can rapidly acquire very long PN sequences. There is no known way to accomplish this with traditional approaches using similar hardware resources.
We examined in detail the design trade-offs in choosing an appropriate architecture for the main component: a forward backward algorithm based decoder. We then demonstrated how to combine multiple redundant models into a single model to reduce memory usage. Finally, we gave a detailed account on our hardware implementation and discuss various implementation techniques. Our design can be fit to a small FPGA while full parallel search is impractical to implement and serial search is fiver orders of magnitude slower than our design.
Future work will be focused on designing hardware for a more realistic system model which incorporates oversampling, interference and multi-path channel distortions.
APPENDIX UPDATE EQUATIONS FOR THE 4-STATE FSM DECODER
The update equations for the FSM are direct application of the standard message passing algorithms [21] on Fig. 10(c) . The variables F k , B k , LI 0 k , LI 1 k and RI k are defined in Fig. 10(d) . Alternatively, they can be derived from Fig. 10(b) by applying standard SISO update rules to each SISO. We list the equations based on Fig. 10(c) 
