Implementation Aspects of a Transmitted-Reference UWB Receiver by Casu, Mario Roberto & Durisi, Giuseppe
WIRELESS COMMUNICATIONS AND MOBILE COMPUTING
Wirel. Commun. Mob. Comput. 2005; 5:537–549
Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/wcm.309
Implementation aspects of a transmitted-reference
UWB receiver
Mario R. Casu1,*,y and Giuseppe Durisi2
1CERCOM-Politecnico di Torino Corso Duca degli Abruzzi 24 I-10129 Torino, Italy
2Istituto Superiore Mario Boella Via Pier Carlo Boggio 61 I-10138 Torino, Italy
Summary
In this paper, we discuss the design issues of an ultra wide band (UWB) receiver targeting a single-chip CMOS
implementation for low data-rate applications like ad hoc wireless sensor networks. A non-coherent transmitted-
reference (TR) receiver is chosen because of its small complexity compared to other architectures. After a brief
recapitulation of the UWB fundamentals and a short discussion on the major differences between coherent and
non-coherent receivers, we discuss issues, challenges and possible design solutions. Several simulation results
obtained by means of a behavioral model are presented, together with an analysis of the trade-off between
performance and complexity in an integrated circuit implementation. Copyright# 2005 John Wiley & Sons, Ltd.
1. Introduction
Ultra wide band (UWB) systems using short pulse
modulation are nowadays considered a viable solution
for short-range mobile communications. This technol-
ogy features some advantages that make it particularly
suitable for such type of systems (especially for
indoor environments). More speciﬁcally, these fea-
tures are the multipath resistance, the high time
resolution, which allows to limit interference pro-
blems, and ﬁnally the capability to conjugate low
emitted power, high data rate, and reuse of frequency
bands already assigned to other kind of services.
Generally speaking, UWB systems are characterized
by the emission of periodically repeated short dura-
tion pulses that replicate a basic waveform. The
period between two pulses may be constant or vari-
able, depending on the adopted modulation scheme
and on the multiple access technique the system
employs. Typical modulation techniques are pulse
amplitude modulation (2-PAM), pulse position mod-
ulation (PPM), and on-off keying (OOK). Another
characteristic of all UWB devices is the short duration
of the transmitted waveform. Due to this short time
span, the spectrum of a UWB signal may extend over
several gigahertz, thus overlapping part of the bands
used by narrowband systems. The high time resolution
due to the transmission of narrow pulses, on the order
of a few nanoseconds, enables accurate localization
applications (few centimeters of resolution). This
makes UWB appealing for low data-rate applications
like location-aware wireless sensor networks [1].
Another important feature of UWB technology
using baseband pulse modulation is the drastic sim-
pliﬁcation of the analog front-end of the receivers.
After low noise ampliﬁcation, the baseband analog
signal can be directly converted into a digital format,
enabling the exploitation of the ﬂexibility (program-
mable/reconﬁgurable hardware), the noise robustness,
and the low power consumption of digital CMOS
integrated circuits.
The above-mentioned advantages might enable the
integration into a single CMOS chip of all digital,
analog, and radio-frequency functions. However, to
the best of our knowledge, no fully integrated solu-
tions are commercially available. Only few recent
*Correspondence to: Mario R. Casu, CERCOM-Politecnico di Torino Corso Duca degli Abruzzi 24 I-10129 Torino, Italy.
yE-mail: mario.casu@polito.it
Contract/grant sponsor: MIUR (Italian Ministry of Education and Research).
Copyright# 2005 John Wiley & Sons, Ltd.
publications have the complete integration of a UWB
transceiver in a CMOS technology as a ﬁnal goal
[2–5]. However, these works are limited in scope by
the assumptions on the channel model. In some cases
an AWGN channel is assumed, leading to receiver
concepts based on matched ﬁlters, which are expected
to perform poorly on a multipath channel like the
UWB indoor one. Other approaches are based on
Rake receiver architectures, which show adequate
performance only when equipped with a large number
of ﬁngers. As a result, the power consumption may be
too high for low power applications. Research in the
ﬁeld of low-complexity receivers (e.g., working in
absence of channel state information [6–9]) is limited
today to the system-level analysis, and publications
describing detailed implementation of the overall
receiver are still lacking. Some aspects of the receiver
are discussed in Reference [10] (low noise ampliﬁer)
and References [2,11] (wideband A/D converters with
reduced output dynamic). Finally, Reference [12]
contains an interesting survey on UWB transceiver
challenges, especially as far as the A/D converter is
concerned.
The main contributions of this paper are the
following:
 A fully digital implementation of a UWB non-
coherent receiver based on the transmitted-refer-
ence principle [6] is discussed. The receiver is
assumed to operate on a UWB indoor channel,
modeled as in Reference [13].
 A limited complexity blind synchronization algo-
rithm is presented and its performance analyzed,
assuming dense multipath channel and no multiple
access interference.
 The hardware implementation of both the A/D, and
the baseband part of the TR receiver is discussed. In
particular, we present a wide set of implementation
possibilities for both the synchronization and the
demodulation operations and we rank them accord-
ing to the number of resources required for the
implementation.
The paper is organized as follows: in Section 2, we
analyze the principal structures proposed so far in the
literature for UWB receiver, brieﬂy discussing their
complexity from an implementation viewpoint.
Furthermore, we present the principal characteristics
of a TR receiver at the system level. In Section 3, we
compare digital and analog implementation solutions
for the TR receiver and in Section 4, the synchroniza-
tion and tracking algorithms are presented, together
with an analysis of the precision requirements of the
A/D. The hardware integration challenges and solu-
tions are discussed in Section 5. Finally, Section 6
summarizes the achievements of this work.
2. Coherent versus Non-Coherent
Receivers
As already mentioned, the ﬁne delay resolution of
UWB signals provide a high robustness in dense
multipath environments [14]. In case of perfect
channel estimation and synchronization (coherent
reception), absence of intersymbol and multiuser
interference, it is well known [15] that a Rake receiver
is the optimal detection scheme, in the sense that it
minimizes the probability of error measured at the
receiver end. However, if the knowledge of the chan-
nel is not ideal but, is acquired through a suitable
estimation algorithm, this structure reduces to a heur-
istic approximation of the optimal detection scheme.
Adopting the same terminology as in Reference [16],
we will refer to this class of receivers as pseudo-
coherent Rake receivers.
To fully exploit the channel diversity, a pseudo-
coherent Rake receiver must be able to capture and
track the energy associated with a high number of
multipath replicas. In Reference [13], it is shown that
the number of paths to be considered to reach the 85%
of the overall energy can sometimes exceed 100. In
addition, the radiation and propagation processes can
act on the transmitted pulse as a ﬁlter whose char-
acteristics vary from path to path. Therefore, the
received signal can be seen as a train of distorted
waveforms that often show little resemblance with the
transmitted pulse [17,18].
Due to complexity constraints, only a small subset
of the received replicas is expected to be selected and
combined, a fact that justiﬁes the performance loss
illustrated in References [14,19,20] and Reference [6]
for various selection combining methods. Further-
more, the presence of pulse distortion increases the
complexity of the channel estimation algorithm
[21,22], a topic that has not been fully analyzed in
the literature yet. In general, it can be expected that
the presence of complexity constraints, for example
low power consumption, will impose sub-optimal
solutions for the channel estimation process, with a
consequent further performance loss.
A different approach to overcome all the above-
mentioned disadvantages is based on the use of
techniques that do not explicitly try to estimate the
538 M. R. CASU AND G. DURISI
Copyright# 2005 John Wiley & Sons, Ltd. Wirel. Commun. Mob. Comput. 2005; 5:537–549
channel and, possibly, require a less power-consuming
receiver architecture. This class of receivers is de-
noted in the literature as ‘non-coherent’ [23–27],
meaning that the demodulation is performed in ab-
sence of channel state information at the receiver.
These receivers allow to capture a large amount of
the transmitted energy, despite the distortions and
multipath propagation experienced by the signal
through the transmission over the UWB channel.
They represent, however, a sub-optimal solution, if
compared to coherent receivers because of the adop-
tion of a noisy signal as a reference waveform for the
demodulation process. Among these receivers, it is
worth mentioning, for their inherent architectural
simplicity, the differential [9], the TR [6], and energy
detector [25]. For a survey on non-coherent receivers
for UWB applications and a comparison in terms of
performance with pseudo-coherent Rake architec-
tures, the interested reader is referred to Reference
[28] and references therein.
In this work, we focus on the TR receiver; however,
all the results presented in the next sections can be
easily extended to the differential case due to the
similarities between the two schemes. We consider the
TR modulation format described in Reference [7].
According to this model, the transmitted UWB signals
are grouped in blocks of length N. Each block consists
of Nr reference pulses and Nd amplitude modulated
ones. The channel is assumed to be static during one
block and to change independently from block to
block (block fading assumption). This assumption is
expected to be valid for UWB systems operating in
indoor environments, as the indoor channel is char-
acterized by a rather long coherence time. In formu-
las, the received signal rðtÞ corresponding to one
transmitted block is given by:
rðtÞ ¼
XNr1
j¼0
s t  jTfð Þ þ
XNd1
j¼0
bjs t  ð jþ NrÞTfð Þ þ nðtÞ
ð1Þ
where fbjgþ1j¼1 is the sequence of information bits,
sðtÞ is the received pulse obtained convolving the
transmitted pulse and the channel impulse response
of the channel, nðt) is AWGN with two-sided noise
spectral density N0=2 and Tf is the frame time, larger
than the delay spread of the channel to avoid inter-
symbol interference. At the receiver each modulated
pulse is correlated with an internal template wave-
form, obtained through an average over the reference
pulses. Finally, a hard decision is employed.
It is clear that the performance of a TR receiver
increases as Nr increases because of the improved
quality of the internal reference (the average ﬁlters out
the white noise); on the other hand, at the same time,
the data-rate decreases. While a substantial improve-
ment is observed passing from Nr ¼ 1 to Nr ¼ 2, a
further increase of complexity does not lead to a
proportional bit error rate (BER) reduction. Thus,
we chose for our benchmark implementation a TR
receiver whose reference sequences consist of a few
symbols (typically less than four).
From the BER perspective, if a packet transmission
is assumed, the DR can be thought as a TR with
Nr ¼ 1. However, the latter requires the channel to be
constant over all the packet, while for the differential
receiver this assumption is relaxed to two adjacent
symbols.
3. Analog and Digital Domain Processing
The choice of the TR receiver still leaves room for a
number of alternatives concerning the partitioning be-
tween analog and digital circuits. Two extreme cases are
shown in Figure 1. The fully analog receiver correlates
the incoming data with an internal analogical template,
and samples the correlation result. A hard decision
consists in a simple 1-bit A/D conversion. The fully
digital scheme converts the RF signal immediately after
the LNA ampliﬁcation and then performs a digital
correlation and decision. The analog correlator is
critical as it handles a very high frequency and large
bandwidth signal, while the sampling is done at the
pulse repetition rate that for low power low data-rate
Fig 1. (a) Fully analog receiver; (b) Fully digital receiver.
ASPECTS OF UWB RECEIVER 539
Copyright# 2005 John Wiley & Sons, Ltd. Wirel. Commun. Mob. Comput. 2005; 5:537–549
applications means 0.1–1.0ms. The digital scheme is
relatively simple from the signal processing viewpoint;
on the other hand the A/D converter requires a very high
sampling rate in order to meet the Nyquist criterion.
Nonetheless, the analog template generation is even
more difﬁcult. In particular, analog delay lines are
needed to align the template signal and the received
pulse. The delay can be as long as the repetition period
and, with the currently available technologies, it is not
feasible to provide such kind of delays on-chip. There-
fore, provided that a sufﬁciently fast A/D is available,
the digital alternative is the most suitable for a single
chip CMOS implementation. We will therefore refer to
this fully digital scheme in the following. The impact of
this choice on system parameters, like cost and power
consumption may be relevant, especially in the light of
the power consumption of fast A/D. Depending on the
application constraints, like battery lifetime in wireless
sensor networks other choices might be more suitable.
4. Behavioral Model
In order to compare the hardware design with a
reference model, we implemented a behavioral model
of the receiver TR architecture as a C program. Within
the C framework, synchronization and demodulation
with early-gate ﬁne tracking are fully characterized.
4.1. Bandwidth
According to FCC regulation, UWB devices for in-
door environments are allowed to operate in the
bandwidth below 960MHz and from 3:1 to
10:6GHz. The ﬁrst bandwidth is reserved for imagin-
ing systems, while the latter for communications and
measurement devices. In this paper, we will ignore
this disposition and assume to transmit the UWB
pulse in the bandwidth below 960MHz. A similar
assumption was made in References [2] and [3]. More
precisely, we will assume that the transmitted pulse
has a bandwidth of 500MHz, and the A/D converter
samples at 1GHz. However, the model and the hard-
ware implementation alternatives that we present in
the following sections are completely parametric and
can be adapted to other bandwidths.
4.2. Channel Model
The test-vectors of the channel response are built
using the IEEE 802.15.3a model [13], which is based
on a modiﬁcation of the Saleh-Valenzuela’s [32]. This
model takes into account the clustering phenomena
observed in several UWB channel measurements [33].
According to Reference [13], the channel impulse
response can be modeled as
hðtÞ ¼
XL
l¼0
XH
h¼0
l;hðt  Tl  l;hÞ ð2Þ
where fðkÞl;h g are the multipath gain coefﬁcients,
fT ðkÞl g and f ðkÞl;h g represent the delay of the lth cluster
and of the hth multipath ray relative to the lth cluster
arrival time. The distribution of clusters and rays
interarrival time is exponential. The average power
delay proﬁle shows a double exponential decay (for
cluster average power and for rays average power
in each cluster), and the fading statistics is
lognormal. Finally, the sign of each multipath
replica is either positive or negative, with the same
probability.
The maximum delay spread of the channel is equal
to 100 ns (or 100 samples). A white Gaussian noise is
added to the samples with variable SNR as well as a
narrowband interferer (optional). A single user sce-
nario is considered, assuming that some form of MAC
protocol (TDMA or CSMA) is employed to limit
multiple access interference.
4.3. Synchronization
4.3.1. Introduction
The purpose of the synchronization algorithm is to
recover, at the receiver, the timing information re-
quired to perform the demodulation and the detection
of the transmitted information. In our case, timing
recovery may be conveniently viewed as a two-part
process, the ﬁrst consisting of estimating the time
instant representing the starting point of the pulse
(symbol-level synchronization) and the second aiming
at identifying the position of the ﬁrst reference pulse
in each transmitted frame (block-level synchroniza-
tion). Therefore, our aim is to accomplish these two
tasks with an algorithm which offers the best com-
promise between complexity and performance.
Interestingly, if one approaches the problem of
timing recovery from a theoretical point of view,
with the objective to derive the best strategy according
to some criterium (e.g., the minimization of the
Euclidean distance between transmitted and estimated
signal), it turns out that the synchronization problem
is strictly related to the channel estimation one. This is
shown for example in Reference [31], where a least
540 M. R. CASU AND G. DURISI
Copyright# 2005 John Wiley & Sons, Ltd. Wirel. Commun. Mob. Comput. 2005; 5:537–549
square algorithm for the joint estimation of what we
called symbol-level and block-level synchronization
is presented in a slightly different context. The afore-
mentioned algorithm assumes that training sequences
are sent and achieves synchronization through an
optimization over a space of dimension 2. It is worth
noting that the algorithm produces as a by-product
an estimation of the impulse response of the channel.
The strategy described in Reference [31], albeit
characterized by good performance, is of scarce utility
to our purposes because of its high computational
complexity.
In the next section we will, instead, propose an
algorithm, suboptimal with respect to the one pre-
sented in Reference [31], which performs symbol
synchronization based on heuristic considerations.
Its performance is expected to be rather poor, if
compared to more sophisticated approaches. How-
ever, its limited complexity makes it an interesting
solution for our particular context.
4.3.2. Symbol-level synchronization
We assume that each frame contains Nf samples,
where Nf is obtained by dividing the frame length Tf
by the sampling time, and that the number of sig-
niﬁcative samples per pulse is Nw. Our algorithm
is based on the sliding correlation principle. In for-
mulas, given the vector of received samples r ¼
ðr½1; r½2; . . . ; r½MÞ, the Nf correlation values
SbitðkÞ; k 2 ½1;Nf  and the starting sample kmax are
computed as follows:
SbitðkÞ ¼
XNp
j¼1
XkþNw1
i¼k
r½i  r½iþ jNf 


k ¼ 1; 2; . . . ;Nf
ð3Þ
kmax ¼ argð max
k2½1;Nf 
ðSkÞÞ ð4Þ
where Np indicates the number of pulses to be corre-
lated with the ﬁrst one. In words, for each sample of
the ﬁrst analyzed period, the algorithm computes an
estimate of its likelihood of being a potential starting
sample. Then the sample with maximum likelihood is
chosen. A threshold can be set to discard frames
which contain only noise. The following example in
Figure 2 demonstrates this simple algorithm. Let us
consider the discrete signal in Figure 2 and the stream
of bits f. . . ;þ1;1;1;þ1; . . .g. Since Nf ¼ 5 the
algorithm calculates ﬁve estimates. Suppose to start
at sample number 1. Its likelihood of being the starting
sample is calculated correlating the ﬁrst Nw ¼ 3
samples at positions 1; 2; 3 with the next three samples
at position 6; 7; 8 (that is Nf ¼ 5 samples later), and
taking the absolute value:
ð0  0Þ þ ð1  þ1Þ þ ð2  2Þ
¼ 5 ) s1 ¼ j  5j ¼ 5
In the same way, the estimate for sample 2 is
calculated correlating three samples at positions
2; 3; 4 with samples 7; 8; 9:
ð1  þ1Þ þ ð2  2Þ þ ð1  þ1Þ
¼ 6 ) s2 ¼ j  6j ¼ 6
By repeating for the ﬁrst ﬁve samples this proce-
dure, the computed estimates are:
s ¼ ð5; 6; 5; 1; 1Þ
Since the maximum is s2, sample 2 is correctly chosen
as a starting sample. The signal has been correctly
acquired.
This simple algorithm works optimally for a
noise-free signal. On noisy channels, its perfor-
mance can be improved by not limiting the correla-
tion to the adjacent pulse but considering also the
following Np  1 pulses (Np indicates the number
of pulses to be correlated with the ﬁrst one). The
trivial example above, corresponds to the case when
Np ¼ 1. Of course, increasing Np comes at the cost
of increasing acquisition time and/or hardware
Fig. 2. Example of bit stream for symbol-level
synchronization.
ASPECTS OF UWB RECEIVER 541
Copyright# 2005 John Wiley & Sons, Ltd. Wirel. Commun. Mob. Comput. 2005; 5:537–549
complexity, as shown later on. For example, when
Np ¼ 2, the estimate s2 becomes
s2 ¼ jð1  þ1Þ þ ð2  2Þ þ ð1  þ1Þj þ jð11Þ
þ ð2  2Þ þ ð1  1Þj
¼ j  6j þ j6j ¼ 12
Using the IEEE 802.15.3a model, we computed the
performance of this simple algorithm varying Eb=N0
and channel realizations. We assume a signal acquired
if the starting sample kmax is at most 50 samples
(i.e., half the pulse duration Nw ¼ 100) early or late
with respect to the ideal value, that is jkmax  kidealj 
Nw
2
. The rationale behind this simple criterion is that a
large part of the energy still has to be contained in
the Nw samples following the starting sample kmax.
Figure 3 reports the simulation results in terms of
acquisition probability as a function of Eb=N0 and Np.
It is clear from the curves the improvement in acquisi-
tion probability for Np > 1, but the small difference
between Np ¼ 3 and Np ¼ 5 suggests that the increase
of complexity for Np > 3 is not justiﬁed.
4.3.3. Block-level synchronization
Once the signal timing at the symbol-level has been
acquired, a block-level synchronization step, which
reveals the position of the reference bits in the trans-
mitted frames, is needed. This step can be performed
in a blind way, employing an algorithm similar to the
one described for the symbol-level case, or through
the transmission of a training sequence.
4.4. Early-Late Demodulation and Tracking
After synchronization is achieved, the receiver demo-
dulates the Nd data symbols contained in each block.
A template vector t is constructed averaging over the
reference sequence r ¼ ½r1; r2; . . . ; rNr, where with r
we denote the vector containing the discrete samples
associated to the reference part of the transmitted
block. The reference sequence can be multiplied by
a reference pattern p ¼ ½p1; p2; . . . ; pNr ; pi 2 f1; 1g,
if a pattern known to the receiver is employed to
modulate the reference pulses. In formulas:
t ¼ 1
Nr
XNr
i¼1
piri
The demodulated data dj 2 f1; 1g are then
obtained by taking
dj ¼ sign ðt  dTj Þ; j ¼ 1; 2; . . . ;Nd
with d ¼ ½d1; d2; . . . ; dNd, containing the discrete
samples of the data part of the transmitted block.
The ﬁne tracking is implemented by correlating the
data d with the on-time version of the reference and
with the early and late versions, that is anticipated and
delayed by one sample, respectively. Then the demo-
dulated data are retrieved from the sign of the correla-
tion that has the maximum absolute value among
cearly ¼ tearly  dTj
con-time ¼ ton-time  dTj
clate ¼ tlate  dTj
When the early (late) correlation prevails, the
starting sample for the next symbol is anticipated
(delayed) by one position.
Figure 4 reports the error probability evaluated
averaging over 106 realizations of the input signal
and channel. Two curves are reported, the ‘No Synch’
case, where the timing is ideally acquired, and the
‘Synch’ case, where the timing is given by a previous
run of the symbol-level synchronization algorithm.
We suppose perfect block-level synchronization in
both cases. It can be noticed that the second curve
has a slightly larger error probability, but it tracks well
the ﬁrst one.
4.5. A/D Precision Requirements
We estimated the requirements of the A/D in terms
of precision by means of simulations. At ﬁrst, we
Fig. 3. Symbol-level acquisition probability. The results are
averaged over 106 realizations of the input signal and
channel.
542 M. R. CASU AND G. DURISI
Copyright# 2005 John Wiley & Sons, Ltd. Wirel. Commun. Mob. Comput. 2005; 5:537–549
evaluated the probability density function of the signal
amplitude using the IEEE 802.15.3a model, in order
to select the best quantization strategy. The result is
reported in Figure 5. All channel realizations in our
model are normalized such that their energy is equal
to one. This is equivalent to assume that the receiver
employs an automatic gain control (AGC) circuit,
which guarantees a constant input power to the A/D.
This normalization removes from the analysis the
effect of, for instance, different TX-RX distances
which, in turn, would imply a different use of the
A/D input range and a consequent different resolution
in terms of quantization. We noticed that 99% of the
area of the pdf is contained in ½0:5; 0:5. Conse-
quently, we set saturation thresholds to smax ¼ 0:5.
We tested two quantization mappings, as shown in
Figure 6. The ﬁrst mapping fully uses the 2n levels for
n bits but does not have a unique code for the small
values around zero. The second one has one code for
values around zero but, on the other hand, has 2n  1
different levels, thus not fully exploiting the binary
code, because two codes represent 0. The ﬁrst is
suited for the offset-binary representation of the
numbers, while the second one for the sign-magnitude
coding, as is shown in the ﬁgure. The ﬁrst one is the
only choice in case of n ¼ 1. The second one sup-
presses the so-called ‘idle noise’ that leads to useless
computation in the digital back-end. Thus, mapping 2
is more suited for a low power digital implementation.
An example of this phenomenon is shown in Figure 7.
From a performance perspective, the two methods are
equivalent. In Figure 8 the results of the simulations
using mapping 1 are reported. The ﬂoating point case
is compared to several quantization cases from 1 to 4
bits. Similar results, not reported for brevity, have
been obtained using mapping 2. Therefore, we can
Fig. 4. Error probability: ideal bit-level acquisition com-
pared to the algorithm described in Equation (3).
Fig. 5. Amplitude statistical distribution for the IEEE
802.15.3a channel model.
Fig. 6. Two possible quantization mappings.
Fig. 7. Effect of the idle noise and its suppression using a
proper coding.
ASPECTS OF UWB RECEIVER 543
Copyright# 2005 John Wiley & Sons, Ltd. Wirel. Commun. Mob. Comput. 2005; 5:537–549
conclude that none of them is preferable in terms of
performance. Other parameters like ease of imple-
mentation or power consumption can be used for
further discriminating between the two. Analysing
the results, we notice that the 1-bit and 2-bit quantiza-
tions have poor performance. The 4-bit curve is close
to the ﬂoating point one, and can be considered
sufﬁcient for implementation purposes. The fact that
a small number of bits is sufﬁcient, is relevant from
the implementation perspective: the actual precision,
in terms of ‘effective number of bits,’ reduces sig-
niﬁcantly for ultra-fast, high-bandwidth A/D [29] like
the ones needed for the UWB system simulated here.
5. Hardware Integration: Challenges
and Solutions
We will not discuss the integration issues of LNA and
ﬁlters in a mixed-signal CMOS chip. We will focus,
instead, on the baseband part of the receiver with a
few hints regarding the A/D converter. The integration
of this part of the receiver is very critical; the precision
requirements, as stated in Section 4, are not an issue,
the main problem being the very high bandwidth.
Some proposals [2,3,12,30] suggest the use of a
bank of Nad parallel A/D converters, each with sam-
pling frequency fs=N and delay between two conse-
cutive converters of T ¼ 1=fs. In Reference [2],
each pulse repeats every Trep ¼ 1=frep ¼ 1=62:5
106 ¼ 16 ns, the sampling period is Ts ¼ 1=fs ¼
1=2 109 ¼ 0:5 ns (2GHz), therefore Nad ¼
16=0:5 ¼ 32 parallel A/D converters.
This solution is mandatory, nonetheless it still
presents difﬁculties:
 Even if the sampling frequency is reduced of 1=Nad,
each A/D must have a bandwidth equal or larger
than the UWB pulse bandwidth.
 The clock generation is critical, since each clock
phase must maintain a stable and small relative
delay. The jitter must be kept well under control.
 If the duty-cycle of the pulse is very small
(Tpulse  Trep and so Nw  Nf according to the
notation used above), like for low data-rate applica-
tions, the number of parallel A/D and so the number
of clock phases is too high for area/power con-
strained silicon implementations.
While the ﬁrst two points require appropriate
circuit and technology solutions, the third one can
be faced by a suitable arrangement of the A/D
registers. The example in Figure 9 is the case of
Nf ¼1000, Nad ¼ 10. The outputs of the A/D are
organized in shift-registers in order to avoid the
memory input control along with its complex and
slow decoder. Only one window of Nf samples is
registered in the case of Figure 9. If more than
one window is necessary, as for the previously
described synchronization algorithms, the shift-
register size can be easily extended.
Fig. 8. PðeÞ parametric with n quantization bits.
Fig. 9. A possible arrangement of parallel A/D for the case Nf ¼ 1000.
544 M. R. CASU AND G. DURISI
Copyright# 2005 John Wiley & Sons, Ltd. Wirel. Commun. Mob. Comput. 2005; 5:537–549
In the following, we will discuss the architecture of
the receiver for the implementation of the synchroni-
zation and demodulation algorithms.
5.1. Symbol-Level Synchronization
Many implementations ranging from parallel to se-
quential are possible.
(1) Parallel Implementation: We assume to have
sufﬁcient resources for executing the algorithm
of Equation (3) in one clock cycle (or few clock
cycles, in case of pipelining). The complexity
requirements obtained by simply unrolling the
loops are reported in Table I, where MEM are
memory elements, MUL stands for multipliers,
and ADD for adders. The size in bits of each
memory element corresponds to the number of
quantization bits of the A/D. Many multiplica-
tions are unnecessarily repeated many times. For
instance, for k ¼ 0 and k ¼ 1 (and with j ¼ 1) the
sequence of Nw products are the following
k ¼ 0 :r½0r½Nf ; r½1r½1þNf ; r½2r½2þNf  . . .
r½Nw2r½Nw2þNF; r½Nw1r½Nw1þNF
k ¼ 1 :r½1r½1þNf ; r½2r½2þNf ; r½3r½3þNf  . . .
r½Nw1r½Nw1þNF; r½Nwr½NwþNF
and it is clear that Nw  1 products are common to
both sequences. Therefore, the actual number of
different products is much less than NwNpNf and it
is not difﬁcult to compute. For k ¼ 0, we have Nw
different products. For k ¼ 1, we have to add
another product (r½Nwr½Nw þ NF), for k ¼ 2 an-
other product (r½Nw þ 1r½Nw þ 1þ NF), and so
on until k ¼ Nf . This must be repeated for each
j 2 ½1;Np. As a result, the actual number of MUL
is NpðNw þ Nf  1Þ. In a similar way, the number
of ADD can be reduced by considering that many
additions are repeated many times. However, it is
more difﬁcult to evaluate the exact (and mini-
mum) number of additions. The example in Fig-
ure 10, a simpliﬁed case with Nf ¼ 6, Nw ¼ 5,
Np ¼ 1 ðMUL ¼ 6þ 5 1 ¼ 10Þ, shows one
possibility. Products are denoted by Pi,
i ¼ 0; . . . ;MUL 1, while Si, i ¼ 0; . . . ;Nf  1
are the outputs of the correlation step. In this case,
the total number of adders is
bNmul  1=2c þ dNf=2e þ Nf ¼ 4þ 3þ 6 ¼ 13
Shadowed operators refer to the case when
Nf ¼ 7 (Nmul ¼ 11). We expect a number of
adders equal to
bNmul  1=2c þ dNf=2e þ Nf ¼ 5þ 4þ 7 ¼ 16
which is consistent with the ﬁgure. If Np > 1 the
number of adders has to be multiplied by this
factor. The overall complexity of this implemen-
tation is summarized in Table II.
After the evaluation of the correlations, the
algorithm proceeds with the search of the max-
imum. The parallel architecture is simply a tree of
comparators, as Figure 11 shows.
(2) Sequential Implementation: A possible sequential
implementation takes advantage of the previous
observation concerning the reuse of products and
sums. We can rewrite for convenience the esti-
mates as follows
SbitðkÞ ¼
XNp
j¼1
XkþNw1
i¼k
r½i  r½iþ jNF


¼
XNp
j¼1
XkþNw1
i¼k
pði; jÞ

 ¼
XNp
j¼1
jsðk; jÞj
ð5Þ
Table I. Complexity for the parallel implementation according to
Equation (3).
Resource type Number of resources
MEM ðNp þ 1ÞNf þ Nw  1
MUL NwNpNf
ADD NwNpNf
Fig. 10. Example of parallel implementation.
Table II. Actual complexity after loop simpliﬁcation.
Resource type Number of resources
MEM ðNp þ 1ÞNf þ Nw  1
MUL NpðNf þ Nw  1Þ
ADD NpðbNf þ Nw  3=2c þ dNf=2e þ NfÞ
ASPECTS OF UWB RECEIVER 545
Copyright# 2005 John Wiley & Sons, Ltd. Wirel. Commun. Mob. Comput. 2005; 5:537–549
and then write a recursive formula
sðk; jÞ ¼ sðk  1; jÞ  pðk  1; jÞ
þ pðk þ Nw  1; jÞ ð6Þ
with sð0; jÞ computed as in Equation (5). It is
clear that if Equation (6) is evaluated at every
clock cycle, only 1 MUL, 1 ADD, and 1 SUB
(subtractor) are needed, if Np ¼ 1. When Np > 1,
another adder is required for the outer sum
PNp
j¼1.
Table III summarizes the complexity of this
simple implementation.
The computation of the maximum is performed
this time through a sequence of Nf comparisons.
They are implemented using one comparator and
one additional register for saving the temporary
maximum and its index.
(3) Execution Time Considerations: The ﬁrst impor-
tant point to understand is if the synchronization
algorithm needs realtime execution. If not, we can
certainly relax the architectural constraints, espe-
cially for the sequential case. However, since the
memory size for storing incoming data is limited,
we cannot save all inputs and will lose data while
executing the synchronization. At the same time,
we need to compute the number of lost data, so
that when we pass from synchronization to track-
ing we still point to the correct starting sample. In
the realtime case, on the other hand, while the
latter potential issues are avoided, the execution
speed requirement could be really stringent. The
limit for the execution time in the non-realtime
case depends on the design choices, while for the
realtime case we can derive material limits. Sup-
pose we start ﬁlling the input memory at time 0.
Every Ts seconds a new sample is registered. The
algorithm can start as soon as the products Pi,
i ¼ 0; . . . ;MUL 1, (see Figure 10) can be com-
puted. For the parallel case, we have to wait until
time ðNp þ 1ÞNf þ Nw  1, when the last sample
needed for the computation of the product is
stored. In order not to lose data, we can add a
few registers to save the last Nf  Nw samples of
the last symbol used in the synchronization algo-
rithm. These samples are not used for the compu-
tation of the fPig products, but can be needed for
the following demodulation step. The parallel
execution has to be completed in no more than
ðNf  NwÞTs seconds. Suppose we run the digital
circuit at clock frequency fck; thus, we need to
complete the execution in Nck ¼ ðNf  NwÞTs fck
clock cycles. A numerical example may help for
sensitizing to the orders of values. Suppose
 Nw ¼ 100; Nf ¼ 1000; fs ¼ 1 GHz; fck ¼
100MHz:
Then, we are left ð1000 100Þ 109100
106 ¼ 90 clock cycles, which is abundant for the
parallel implementation. Therefore, we can re-
duce the clock frequency or the size of the addi-
tional memory. Many options can be exploited
depending on the specs.
The sequential algorithm can start as soon as
the ﬁrst product is available, at time ðNf þ 1ÞTs.
If the same assumption of completing the storage
of the Np þ 2th bit holds, the time left for com-
putation is ðNp þ 1ÞNfTs. The clock cycles
needed for this aim are about NpðNf þ NwÞ.
Therefore, the following lower bound holds:
fck 	 NpðNf þ NwÞðNp þ 1ÞNfTs >
Np
Np þ 1 fs
For low sampling rates (i.e., less than 2GHz)
this is still feasible in CMOS, even if not straight-
forward. For higher rates, other solutions must be
conceived, like increasing the input buffer size or
relaxing the ‘no data lost’ criterium.
(4) Mixed Parallel/Sequential Architecture: Inter-
mediate solutions between the two extreme
Fig. 11. Bank of parallel comparators for the maximum
evaluation.
Table III. Complexity of the sequential implementation.
Resource type Number of resources
MEM ðNp þ 1ÞNf þ Nw  1
MUL 1
ADD 1 if Np ¼ 1; 2 if Np > 1
SUB 1
546 M. R. CASU AND G. DURISI
Copyright# 2005 John Wiley & Sons, Ltd. Wirel. Commun. Mob. Comput. 2005; 5:537–549
parallel and sequential implementations are pos-
sible. One consists of applying sequentially, for
Np times, a parallel search limited to a correlation
between two windows. In general, whatever is the
strategy used for computing the correlation the
execution time limit under the requirement of not
losing any data is ðNp þ 2ÞNfTs  NpNcorrTck,
where Ncorr is the number of clock cycles needed
for correlating two windows. For example, at
2GHz sampling rate and 100MHz clock fre-
quency, with Nf ¼ 1000 and Np ¼ 4, the number
of clock cycles for the correlation is Ncorr ¼ 75.
This cannot be achieved with a totally sequential
architecture that requires Nf þ Nw > 1000 cycles
for each correlation. On the other hand, a parallel
solution could be too costly and unnecessary fast.
A mixed parallel-sequential solution is more
suited for this case.
5.2. Demodulation and Tracking
The demodulation consists of two phases, template
calculation and computation of the correlation be-
tween received signal and template. Again, parallel
and sequential solutions can be implemented.
(1) Parallel Solution: As it was previously discussed,
the template is built as an average, sample by
sample, of the ﬁrst Nr reference symbols in a
frame. Since Nw samples are used out of the Nf
ones that constitute each symbol, the number of
adders for a parallel solution is NwðNr  1Þ. The
number of different products, and also of MUL
for a parallel implementation, is NwNd. All these
products have to be summed sample by sample.
Thus, the number of adders is ðNw  1ÞNd. The
memory requirement is simply ðNr þ NdÞNw.
Table IV summarizes the resource demand, while
a schematic with Nr ¼ 2 is reported in Figure 12.
If an early-late architecture is implemented, the
multiply-accumulate structure highlighted in ﬁg-
ure will be replicated three times.
(2) Sequential Solution: The sequential implementa-
tion consists of an adder for the template compu-
tation and a multiply-accumulate (MAC) unit for
the correlation. For the implementation of the
early-late tracking, three MACs are used. Figure
13 shows an example of this sequential structure.
(3) Execution Time: While the acquisition step may
be executed off-line, if allowed by the system
specs (at the cost of losing some data, see the
previous section), the demodulation phase is ne-
cessarily realtime. As a consequence, it is easy to
derive the execution time constraints. As far as the
parallel execution is concerned, the demodulation
starts after the last sample of the Nd-th bit is
received. If we are not allowed to use additional
buffers for the incoming new data, we are still left
with ðNf  NwÞTs seconds before the ﬁrst sample
of the new reference arrives. If Nck cycles are
necessary to complete the execution, the con-
straint for the clock frequency is the following:
Nck  ðNf  NwÞTs fck
For example, with Nf ¼ 1000, Nw ¼ 100;
fck ¼ 100MHz; fs ¼ 1GHz, we get Nck 	 90
which is superabundant for the parallel imple-
mentation, meaning that there is room for further
optimization.
As for the sequential solution, the number of
clock cycles can be easily derived: NrNw for the
template and NdNw for the demodulation. There-
fore, the constraint is the following:
ðNr þ NdÞNwTck  ðNr þ NdÞNfTs
or, in other words,
fck 	 Nw
Nf
fs
Table IV. Complexity of a parallel solution.
Resource type Number of resources
MEM ðNr þ NdÞNw
MUL NwNd
ADD ðNw  1ÞNd
Fig. 12. Schematic of a parallel demodulator.
ASPECTS OF UWB RECEIVER 547
Copyright# 2005 John Wiley & Sons, Ltd. Wirel. Commun. Mob. Comput. 2005; 5:537–549
This value can be relatively high. It amounts to
100MHz in the previous example, a quite relaxed
value compared with fs.
(4) Intermediate Solution: Mixed parallel-sequential
solutions can be a good compromise between area
and clock frequency. The easiest is to duplicate
the structure in Figure 13 and let it work on two
different symbols, symbol 2i and symbol 2iþ 1,
or on even and odd samples of the same symbol.
By n-uplicating the sequential structure, we tend
to the parallel implementation. Therefore, the
actual complexity lies between the two previously
presented architectures.
In general, the time necessary to complete the
demodulation must respect the following con-
straint:
ðNt þ NdNdemÞTck  ðNd þ NrÞNfTs
where Nt is the time necessary for calculating the
template and Ndem is the number of clock cycle
necessary for demodulating one symbol. There is
room left for performance optimizations and for
meeting the frequency constraint. It is likely that a
typical system will be then implemented with a
mixed parallel-sequential architecture.
5.3. RTL Models and Simulations
The digital back-end of two TR receivers has been
designed at the register-transfer level (RTL), that is
from the A/D output to the demodulated data. The ﬁrst
receiver is based on a high performance parallel
implementation, while the second one is characterized
by a minimum area sequential implementation. Both
were characterized using VHDL hardware description
language and tested on the IEEE 802.15.3a channel.
We veriﬁed by simulation a perfect agreement with
the performance predicted by the behavioral model
written as a C program. The sequential implementa-
tion has been also synthesized and mapped on a ﬁeld-
programmable gate array (FPGA) device (Xilinx
XCV600) and successfully tested on the ﬁeld with a
board equipped with that FPGA device. At present, we
are working on the synthesis and Place-and-Route on
a silicon 0.13 mm technology.
6. Conclusions
In this paper, we discussed the implementation chal-
lenges for an UWB transmitted-reference receiver.
A detailed analysis of the performance, as a function
of the many system and environmental parameters has
been done. The issues and challenges for an integrated
circuit implementation have been discussed. An ana-
lysis of the complexity and architectural constraints
for meeting the system level constraints is also pro-
posed in the paper. The architecture described has
been implemented and veriﬁed on a FPGA equipped
board. An application speciﬁc integrated circuit
(ASIC) version on an advanced CMOS 0.13 mm tech-
nology is currently under development.
Acknowledgment
This work has been sponsored by MIUR (Italian
Ministry of Education and Research) under the project
PRIMO.
References
1. IEEE 802.15 WPAN Low rate Alternative PHY Task Group 4a
(TG4a), [Online]: www.ieee803.org/15/pub/TG4a.html
2. O’Donnell I, Chen MS, Wang S, Brodersen RW. An integrated,
low-power, ultra-wideband transceiver architecture for low-
rate, indoor wireless systems. In Proceedings of the IEEE
CAS Workshop on Wireless Communications and Networking,
September 5–6, 2002.
3. Chen MS. Ultra wide-band baseband design and implementa-
tion. 2002 M.Sc. University of California at Berkeley (advisor:
Brodersen RW).
4. Blazquez R, Lee FS, Wentzloff DD, Newaskar PP, Powell JD,
Chandrakasan AP. Digital architecture for an ultra-wideband
radio receiver. In Proceedings of VTC Fall 2003, Orlando FA,
October 2003.
5. Blazquez R, Newaskar PP, Lee FS, Chandrakasan AP. A base-
band processor for pulse ultra-wideband signals. In Proceed-
ings of the IEEE Custom Integrated Circuits Conference,
Orlando, October 2004.
6. Choi JD, Stark WE. Performance of ultra-wideband commu-
nications with suboptimal receivers in multipath channels.
IEEE Journal on Selected Areas in Communications 2002;
20(9): 1754–1766.
Fig. 13. Example of sequential demodulation architecture.
548 M. R. CASU AND G. DURISI
Copyright# 2005 John Wiley & Sons, Ltd. Wirel. Commun. Mob. Comput. 2005; 5:537–549
7. Franz S, Mitra U. On optimal data detection for uwb trans-
mitted reference system. In Proceedings of Global Telecommu-
nications Conference GLOBECOM 2003, S. Francisco, USA,
1–5, December 2003, vol. 2, pp. 764–768.
8. Hoctor RT, Tomlinson HW. Delay-hopped transmitted-
reference RF communications. In Proceedings of the IEEE
Conference on Ultra Wideband Systems and Technologies
2002, Digest of Papers, Baltimore, USA, 2002, pp. 265–270.
9. Ho M, Somayazulu S, Foerster J, Roy S. A differential detector
for an ultra-wideband communications system. In Proceedings
of VTC Spring 2002, vol. 4, pp. 1896–1900.
10. Bevilacqua A, Niknejad A. An ultra-wideband CMOS LNA for
3.1 to 10.6GHz wireless receivers. In Proceedings of the
International Solid State Circuit Conference, 2004.
11. Newaskar PP, Blazquez R, Chandrakasan AP. A/D precision
requirements for an ultra-wideband radio receiver. In Proceed-
ings of SIPS 2002, San Diego, CA, October 2002, pp. 270–275.
12. Orndorff AM. Transceiver Design for Ultra-Wideband Com-
munications. 2004 M.Sc. thesis, Virginia Polytechnic Institute
and State University (advisor: Buehrer RM).
13. Foerster J. Channel modeling sub-committee report ﬁnal, IEEE
P802.15 02/490r1 SG3a, February 2002.
14. Cassioli D, Win M, Valataro F, Molisch AF. Performance of
low-complexity rake reception in a realistic uwb channel. In
Proceedings of the International Conference on Communica-
tions ICC, vol. 2, New York, USA, 2002, pp. 763–767.
15. Proakis JG, Salehi M. Communication Systems Engineering.
Prentice Hall: Upper Saddle River, New Jersey, 2002.
16. Colavolpe G, Raheli R. Noncoherent sequence detection. IEEE
Transactions on Communications 1999; 9(9): 1376–1385.
17. Cramer RJ, Scholtz RA, Win MZ. Evaluation of an ultra-wide-
band propagation channel. IEEE Transactions on Antennas and
Propagation 2002; 50(5): 561–570.
18. Qiu RC. A study of the ultra-wideband wireless propagation
channel and optimum uwb receiver design. IEEE Journal on
Selected Areas in Communications 2002; 20(9): 1628–1636.
19. Mielczarek B, Wessman M, Svensson A. Performance of
coherent uwb rake receivers using different channel estimators.
In Proceedings of the International Workshop on Ultra Wide-
band Systems (IWUWBS), Oulu, Finland, 2003.
20. Zhang J, Kennedy RA, Abhayapala TD. Performance of rake
reception for ultra wideband signals in a lognormal-fading
channel. In Proceedings of the International Workshop on
Ultra Wideband Systems (IWUWBS), Oulu, Finland, 2003.
21. Lottici V, D’Andrea A, Mengali U. Channel estimation for
ultra-wideband communications. IEEE Journal on Selected
Areas in Communications 2002; 20(9): 1638–1644.
22. Carbonelli C, Mengali U, Mitra U. Synchronization and
channel estimation for uwb signals. In Proceedings of the
Global Telecommunications Conference GLOBECOM 2003,
S. Francisco, USA, 1–5, December 2003, vol. 2, pp. 764–768.
23. Weisenhorn M, Hirt W. Robust noncoherent receiver exploiting
UWB channel properties. In Proceedings of Joint UWBST &
IWUWBS, Kyoto, Japan, May 2004, pp. 156–160.
24. Durisi G, Benedetto S. Performance evaluation and comparison
of different modulation schemes for uwb multiaccess systems.
In Proceedings of the International Conference on Commu-
nications ICC, vol. 3, Anchorage, USA, 2003, pp. 2187–2191.
25. Souilmi Y, Knopp R. On the achievable rates of ultra-
wideband ppm with non-coherent detection in multipath
environments. In Proceedings of the International Conference
on Communications ICC, vol. 5, Anchorage, USA, 2003,
pp. 3530–3534.
26. Rabbachin A, Oppermann I. Synchronization analysis for
UWB systems with a low-complexity energy collection recei-
ver. In Proceedings of Joint UWBST& IWUWBS, Kyoto, Japan,
May 2004, pp. 288–292.
27. Stoica L, Tiuraniemi S, Repo S, Rabbachin A, Oppermann I.
Low complexity UWB circuit transceiver architecture for low
cost sensor tag systems. In Proceedings of PIRMC, 5–8
September 2004, Barcelona, Spain, Vol. 1, pp. 196–200.
28. Durisi G, Benedetto S. Comparison between coherent and
non-coherent receivers for uwb communications. EURASIP
Journal on Applied Signal Processing—UWB State of the
Art, December 2004.
29. Walden RH. Analog-to-digital converter survey and analysis.
IEEE Journal on Selected Areas in Communications 1999;
17(4): 539–550.
30. Helal D, Rouzet P. STMicroelectronics proposal for IEEE
802.15.3a Alt PHY, [Online]: http://grouper.ieee.org/groups/
802/15/pub/2003/Mar03/03139r0P802-15_TG3a-STMicro-CFP-
Presentation.ppt.
31. Carbonelli C, Mengali U. Timing recovery for uwb signals. In
Proceedings of GLOBECOM 2004, Dallas, USA, November
2004.
32. Saleh A, Valenzuela R. A statistical model for indoor multipath
propagation. IEEE Journal on Selected Areas in Commu-
nications 1987; 5(2): 128–137.
33. Kunisch J, Pamp J. Measurement results and modeling aspects
for the uwb radio channel. In Proceedings of the IEEE Con-
ference on Ultra Wideband Systems and Technologies 2002,
Digest of Papers, Baltimore, USA, 2002, pp. 19–24.
Authors’ Biographies
Mario R. Casu received his Electro-
nics Engineer degree (summa cum
laude) in 1998 and his Ph.D. in 2002
from the Politecnico di Torino, Italy.
He is now an assistant professor at the
Politecnico di Torino, Department of
Electronics, and member of the VLSI
laboratory and of CERCOM (center of
excellence for multimedia radiocom-
munications). In 2001, he was with ST Microlectronics
Central R&D working on SRAM development and CMOS
library characterization on a partially-depleted 0.13 micron
SOI technology. His research interests are in the ﬁeld of
digital CMOS circuits modeling and design of systems-on-
chip for ultra wideband applications. He is the co-author of
several papers published in international conferences pro-
ceedings and journals.
Giuseppe Durisi was born in Torino
(Italy) in 1977. He received his Laurea
degree (summa cum laude) from the
Politecnico di Torino in 2001. In
January 2002, he joined Istituto
Superiore Mario Boella, where he
works on the MIUR ﬁnanced national
project PRIMO. From January 2003, he
has been pursuing his Ph.D at the
Department of Electronics of the Politecnico di Torino,
under the supervision of Professor Sergio Benedetto. In
2002, he spent 6 months as a visiting researcher at IMST,
Kamp-Lintfort, Germany, working on the IST FP5 project
Whyless.com. In 2004, he was a visiting student at the
Univesity of Pisa under the supervision of Professor Umberto
Mengali and currently is a visiting student at the Swiss
Federal Institute of Technology (ETH) under the supervision
of professor Helmut Bo¨lcskei. His ﬁelds of interest are: ultra
wideband and channel coding for fading channels.
ASPECTS OF UWB RECEIVER 549
Copyright# 2005 John Wiley & Sons, Ltd. Wirel. Commun. Mob. Comput. 2005; 5:537–549
