A 3.52 Gb/s mmWave Baseband with Delayed Decision Feedback Sequence Estimation in 40 nm by Preyss, Nicholas Alexander et al.
A 3.52 Gb/s mmWave Baseband with Delayed Decision
Feedback Sequence Estimation in 40 nm
Nicholas Preyss, Christian Senning, and Andreas Burg
School of Engineering
Telecommunications Circuits Laboratory
EPF Lausanne, 1015 Lausanne, Switzerland
Wei-Chang Liu, Chun-Yi Liu, and Shyh-Jye Jou
Department of Electronics Engineering
Institute of Electronics
National Chiao Tung University, Hsinchu, Taiwan R.O.C.
Abstract—We present a digital baseband ASIC for 60 GHz single-
carrier (SC) transmission that is optimized for communication scenarios
in which most of the energy is concentrated in the first few channel taps.
Such scenarios occur for example in office environments with strong
reflections. Our circuit targets close-to-optimum maximum-likelihood
performance under such conditions. To this end, we show for the first time
how a reduced-state-sequence-estimation algorithm can be realized for
the 1760 MHz bandwidth of the IEEE 802.11ad standard. The equalizer
is complemented in the frontend by a synchronization unit for frequency
offset compensation as well as a Golay-sequence based channel estimator
and in the backend by an low density parity check (LDPC) decoder. In
40nm CMOS we achieve a measured data rate of up to 3.52 Gb/s using
QPSK modulation.
I. INTRODUCTION
With the continuous growth of the Wi-Fi market over the last years,
it became apparent that band congestion and the limited throughput are
the major obstacles for a further proliferation. Operating at mmWave
frequencies seems to be a promising candidate for high-throughput,
short-distance wireless applications since large amounts of available
spectrum at 60 GHz together with high spectral diversity offer a
significant increase in capacity.
The exploration of mmWave for wireless applications goes along
with a renaissance of single-carrier (SC) modulation, due to its low
requirements on the analog front-end. Many different architectures
have been proposed for SC mmWave systems [1]–[4], but to the best
of our knowledge, the feasibility of sequence estimation has not been
explored for such an application. A full sequence-estimation (SE)
results in maximum likelihood (ML) performance and hence exhibits
an inherent performance advantage over any linear equalizer in the
presence of inter-symbol-interference. Unfortunately, the complexity
of full SE easily exceeds feasibility due to its exponential dependency
on the delay spread and modulation order. The proposed reduced state
SE has the interesting property that it can approach ML performance,
while keeping the complexity at a tolerable level.
The paper is structured as follows. In Section II we define the
system assumptions for our mmWave receiver. Subsequently in
Section III a description of the proposed baseband architecture and
the used algorithms is given. Details of the VLSI implementation and
measurement results of the fabricated chips are shown in Section IV.
II. SYSTEM CONSIDERATIONS
We assume symbol-spaced samples with a frame-format as specified
in the IEEE 802.11ad standard [5] as input. A frame starts with a
preamble for synchronization and channel estimation. A 2176 symbol
long short training field (STF) offers 17 periodic Golay sequences
of length 128 for time and frequency synchronization. The STF is
followed by a 1152 symbol long channel estimation field (CEF)
This work was supported in part by Ministry of Science and Technology
of Taiwan under grant number MOST104-2220-E-009-013, TSMC university
shuttle program and National Chip Implementation Center
Fig. 1: Block diagram of the proposed 60 GHz baseband based on
reduced state sequence estimation.
TABLE I: mmWave Single-Carrier Baseband Overview
DDFSE
Modulation BPSK/QPSK
Trellis Length 3
Individual State Feedback 4
Common History Feedback 22
11ad pilot support yes
LDPC
Schedule Layered
Codeword Size 672
Code Rate 1
2
/ 5
8
/ 3
4
/ 13
16
consisting of two 512 symbol long complementary Golay sequences
and a cyclic postfix.
The subsequent payload is structured such that always 448 data
symbols are preceded by a fixed pilot word. These pilot words give the
payload field a regular block structure which decouples the detection
of symbols which are part of different blocks.
III. MMWAVE DDFSE SINGLE-CARRIER BASEBAND
The top-level block diagram of our digital baseband is shown in
Fig. 1 with the features summarized in Table I. The design features
a synchronization unit for time synchronization and frequency offset
(FO) compensation based on the STF of the preamble. A subsequent
channel estimation (CE) unit performs time-domain estimation of
the channel impulse response (CIR). Equalization and detection of
the BPSK or QPSK data symbols of the payload is performed using
delayed decision feedback sequence estimation (DDFSE). To this end
subsequent blocks of the payload field are buffered and demultiplexed
to two soft-output half-rate Radix-2 DDFSE cores. The resulting
bitstreams are multiplexed after stripping off the pilot words. The
computed log-likelihood ratios for the individual coded payload bits
are forwared in chunks of code words to the subsequent LDPC channel
decoder.
As the targeted symbol rates for the proposed mmWave system
exceeds the clock frequencies for which digital signal processing can
be implemented in a power efficient way, the design is based on a
2-way parallel architecture clocked at half the symbol rate.
Fig. 2: Block diagram of the auto-correlation based parameter
estimation stage in the synchronization unit.
A. Synchronization & Channel Estimation
A joint preamble/boundary detection (PD/BD) and carrier/sampling
frequency offset (CFO/SFO) estimation scheme is proposed in our
previous design [6]. The detection and estimation scheme is based
on the auto-correlation structure as illustrated in Fig. 2. The structure
first uses the normalized auto-correlation with a predefined threshold
to detect the start of the preamble. Then it will calculate the phase
of auto-correlation part to estimate the FO of the carrier and the
sampling. Instead of using the longer length correlation result, this
structure adopts the same length-128 (as the size of Golay sequence)
delay-line with preamble detection in order to prevent extra delay-line
cost. To improve the accuracy of the frequency offset estimation,
an accumulator is used to collect the correlation results to reduce
the effect of noise. After the detection and estimation procedure,
the estimated result is processed to generate 2 ways compensation
signals at the same time. Then the interpolator and phase de-rotator
compensate the effect of SFO and CFO. The Golay-sequences based
CEF of the preamble is forwarded to the channel estimator (CE) in
order to perform a time-domain channel estimation. Using Golay
sequences for channel estimation as proposed in [7] allows the use of
a high-speed, multiplier-free correlation architecture called efficient
golay correlator (EGC) [8].
After estimation of the channel impulse response (CIR) the fine
synchronization unit aligns the symbol stream to the trellis of
the sequence estimation in order to maximize the performance
of the sequence estimator. As a power saving measure the fine
synchronization unit can additionally mask weak feedback taps of the
CIR which are dominated by noise, before passing it to the detector
units.
B. Delayed Decision Feedback Sequence Estimation
The implemented DDFSE [9] algorithm approximates the maximum-
likelihood solution by limiting the sequence estimation to a reduced
state space that covers only the first D = 3 taps of the channel. The
interference from up to V = 26 following, late-arrival (i.e., often
weaker) taps is eliminated by an additional decision feedback (DF)
stage. The expected interference term from each of the remaining taps
is calculated by multiplying the estimated transmitted symbol derived
from the decision history with the corresponding channel coefficient
and used as feedback term for the input signal.
A bit-error rate (BER) performance comparison of different
proposed receivers for mmWave applications in Fig. 3 shows the
advantage of the DDFSE over linear equalizers. The simulations
are performed with the near-location cubicle (CB) scenario of the
IEEE 802.11ad channel model [10], as this would be a typical use
case for a mmWave equipped mobile device at the workplace. In this
Fig. 3: Bit-error rate (BER) comparison between a 1-tap phase
rotater (PR) with DF, a 8-tap linear equalizer (LE) with DF, and
the proposed DDFSE in an office environment.
Fig. 4: Block diagram of a basic DDFSE unit without applied
architectural transformations (elements identical to a mSOVA are
blue).
scenario the LOS path is immediately succeeded by strong reflections
arising from the compact geometries. The simulations show that the
DDFSE can not only exploit the energy of the LOS path but also
the energy of subsequent reflections and hence provides significantly
better BER performance compared to a 1-tap phase rotator with DF
as well as an 8-tap MMSE linear equalizer with DF.
1) Basic Architecture: The straightforward architecture of a DDFSE
unit before the necessary architectural transformations to meet timing
is depicted in Fig. 4. The hardware consists of a modified soft-
output Viterbi Algorithm (mSOVA) unit as proposed in [11] and
the components of the additional DF unit.
mSOVA: The mSOVA part comprises a branch metric (BM) unit,
a combined add-compare-select and path-metric (ACS/PM) unit with
modulo normalization, and a soft-output register exchange. The latter
stores the ACS decisions and the differences of the path metrics
to compute approximate max-log log-likelihood-ratios [11] for the
LDPC decoder during the traceback operation. BPSK modulation is
supported by disabling subcircuits that would be required to support
QPSK modulation.
Decision-Feedback: The DF stage supports up to V = 26 DF taps
and is split into an individual state history feedback (ISHF) and a
common history feedback (CHF) part to reduce its complexity (with
no visible performance penalty). The former comprises only the first
I = 4 DF taps which are computed individually for each state, while
the latter comprises the remaining V − I = 22 DF taps where the
individual paths have already merged with high probability. The ISHF
Fig. 5: Block diagram of the optimized DDFSE architecture after
rearranging BM operations according to their data dependencies.
comprises a dedicated decision register exchange operating in parallel
to the soft-output register exchange and an associated individual state-
history feedback unit. The latter calculates and accumulates a feedback
term based on the decision-history for each state to obtain the per-state
part of the decision feedback. The common history keeps only a single
set of decisions in a shift register that is fed from the first-state output
of the decision register exchange. The common history feedback unit
performs the multiply-accumulate operation of these decisions with
the associated channel taps to obtain the common history decision
feedback value. The segmentation of the decision history in ISHF
and CHF lowers the memory requirements by 79%.
2) Optimization for Timing Closure: The architecture in Fig. 4
requires the least number of operations, but timing closure at the
half-rate target frequency of 880 MHz is difficult. The critical path is
the feedback loop that starts with the computation of the DF terms,
continues to the application of the DF to the received samples and
through the BM unit and ends in the update of the path metric registers
in the ACS/PM unit.
Re-ordering of operations to accommodate late arriving signals,
careful re-timing and speculative pre-computation of multiple can-
didate intermediate results followed by rapid late selection of the
correct result are key techniques to remove this timing bottleneck.
As a first measure, we move the computations of the BM of the
mSOVA to the beginning of the chain (pre-BM) and add a pipeline
register afterwards since this part involves no feedback loops. To also
include the subtraction of the late arriving common history decision
feedback value in this first pipeline stage, we split the computation
of this term into a primary and a secondary component that contains
only the first component and all remaining components of the sum-of-
products, respectively. The first component can not be retimed and is
therefore subtracted after the first pipeline stage, while the late-arriving
second component can easily be retimed and can be included in the
pre-BM unit. At the cost of 400 instead of 145 arithmetic operations,
the proposed latency aware re-ordering of the architecture as shown
in Fig. 5 cuts the critical path significantly.
After this retiming, the ISHF path becomes critical, but can not be
retimed immediately since each sum-of-products depends directly on
the decision outputs of the ACS/PM unit (1st order feedback loop)
through the MUXes in the register-exchange (cf. Fig. 6a). Speculative
pre-calculation for the four possible feedback values (corresponding
to the four possible ACS decisions in case of QPSK) removes this
restriction and allows to dedicate an entire cycle to the DF calculation
and only a single multiplexer which chooses from precalculated
values (cf. Fig. 6b) remains in the feedback path. Interestingly this
transformation comes with very little additional complexity, in fact
only 16 additional four-way multiplexer and the memory for the
feedback signals are added as the number of required sum-product
calculations does not change.
Finally, generation of the different feedback signals still requires
(a) Direct implementation
(b) Speculative implementation
Fig. 6: Transformation of the critical path through the ISHF unit by
using speculative pre-calculation of possible feedback values.
sums over a considerable number of individual feedback terms, each
calculated as product of a channel coefficient with the corresponding
decision in the path history. Instead of re-computing these sums-of-
products with each update of the corresponding decision histories, the
different feedback signals are generated with the help of reconfigurable
look-up tables (LUTs). After channel estimation these register-based
LUTs are populated with the possible values and subsequently clock-
gated during the actual detection phase. Due to the small number of
taps, the ISHF is generated by a single LUT, while for the significantly
longer CHF, a hybrid approach with the LUTs storing partial sums is
used. The LUTs are always initialized for QPSK, for BPSK only a
subset is used.
C. Low-density Parity Check (LDPC) Channel Decoder
Channel decoding is performed by a IEEE 802.11ad-compliant
soft-input LDPC decoder. The decoder supports a code-word size of
672 bits and all 4 code rates ( 1
2
/ 5
8
/ 3
4
/ 13
16
) specified in the standard.
Fast decoding convergence is achieved by using a layered schedule
for processing the parity checks. The layered architecture makes use
of the quasi-cyclic structure of the LDPC code to process all 42
parity checks of a layer in parallel. In order to achieve the required
throughput an additional level of parallelism is required. The columns
of the parity check matrices are analyzed and partitioned into two
sets. As a result each layer can be processed in two units in parallel
and the throughput is nearly doubled.
Double-buffering of the memory structures assures continuous
operation. An integrated early termination mechanism saves energy
in the high SNR regimes. A more detailed description of the LDPC
architecture is given in [12].
IV. VLSI IMPLEMENTATION & MEASUREMENT
The proposed baseband design is fabricated in 40nm CMOS
technology. A micrograph of the fabricated chip is shown in Fig. 7.
Fig. 7: Micrograph of the manufactured die. The chip accomodates
two independent designs in a single pad frame.
TABLE II: Key facts of the VLSI implementation
CMOS Technology 40 nm
Core Area 2.7 mm2 (Utilization: 67%)
G
at
e
C
ou
nt Synchronization 139 kGE
Channel Estimator 64 kGE
DDFSE core I+II 1443 kGE
LDPC Decoder 363 kGE
Clock Tree 75 kGE
Reset Tree 25 kGE
Control&Buffer 97 kGE
Test Structures 457 kGE
Total 2563 kGE
The available core area is 2.7mm2 and the design was placed with
a core utilization of 67 %. The core area additionally contains an
internal core clock generation and dedicated test structures. Details
on the complexity of the different blocks are given in Table II.
The test structure comprises a command buffer and memory for the
test vectors to overcome limitations of the I/O interface. The memory
is not only used for the input samples of the baseband, but also stores
the output of the channel decoder.
Measurements of the fabricated chip were performed in order to
verify its correct operation. The required throughput could be achieved
with a core supply voltage of 1.15 V. The number constitutes an upper
bound, as due to limitations of the measurement setup not all IR-drop
on the test setup could be accounted for. The given throughput values
specify the raw throughput. Net data rate depend on used pilot word
scheme and code rate. The comparison of the results in Table III
shows that the proposed DDFSE does not support as high modulation
orders as proposed linear equalizer-based basebands, nevertheless for
low modulation orders DDFSE can achieve better performance with
little additional complexity overhead.
V. CONCLUSION
A full baseband system for single-carrier with synchronization,
channel estimation, reduced state sequence estimation based detection
and channel decoding could be presented. The system demonstrates the
feasibility of sequence estimation for Gb/s wireless systems. Further
it was shown that the DDFSE algorithm can provide considerable
BER performance gains in small structured environments with strong
close-in reflections immediately after the LOS path.
TABLE III: Comparison of mmWave SC receivers
[1] [2] [3] [4] This Work
BPSK BPSK BPSK BPSK BPSK
QPSK QPSK QPSK QPSK
16QAM 16QAM
Sync no yes yes yes yes
CE yes no yes yes yes
EQ LE 1-tap OS-FDE RLS DDFSE
DF yes no no no yes
Code no LDPC LDPC LDPC LDPC
CMOS 65 nm 40nm 40 nm 65 nm 40 nm
core 2.34mm2 1.15mm2 46.62mm2a 16mm2b 2.7mm2
rate 2 Gb/s 6.3 Gb/s 1.8 Gb/s 4.6 Gb/s 3.52 Gb/sc
aincluding ADC/DAC,TX,MAC
bincluding ADC/DAC,TX,uP
cat 1.15 V core supply
In order to meet the stringent timing requirements of the system,
an architecture based on two half-rate Radix-2 DDFSE cores was
implemented. Further speculative pre-calculation combined with re-
ordering of arithmetic operations and the introduction of pipeline
stages improve the timing of the critical path significantly. Look-up
tables with pre-calculated intermediate results are used in the decision
feedback section to increase the power efficiency and further reduce
the critical path.
A multi-parallel QC-LDPC decoder provides the required channel
decoding throughput. Early-termination, clock-gating and a layered
processing schedule improves the energy efficiency of the decoder.
REFERENCES
[1] J.-H. Park, B. Richards, and B. Nikolic, “A 2 Gb/s 5.6 mw digital
LOS/NLOS equalizer for the 60 GHz band,” IEEE J. Solid-State Circuits,
vol. 46, no. 11, pp. 2524–2534, 2011.
[2] K. Okada et al., “Full four-channel 6.3-gb/s 60-GHz CMOS transceiver
with low-power analog and digital baseband circuitry,” IEEE J. Solid-State
Circuits, vol. 48, no. 1, pp. 46 –65, jan. 2013.
[3] N. Saito et al., “A fully integrated 60-GHz CMOS transceiver chipset
based on WiGig/IEEE 802.11 ad with built-in self calibration for mobile
usage,” IEEE J. Solid-State Circuits, vol. 48, no. 12, pp. 3146–3159,
2013.
[4] K. Ma et al., “An integrated 60GHz low power two-chip wireless system
based on IEEE802.11ad standard,” in Microwave Symposium (IMS), 2014
IEEE MTT-S International, June 2014, pp. 1–4.
[5] “IEEE standard - part 11 amendment 3: Enhancements for very high
throughput in the 60 GHz band,” IEEE Std 802.11ad-2012, pp. 1–628,
Dec 2012.
[6] W.-C. Liu, T.-C. Wei, Y.-S. Huang, C.-D. Chan, and S.-J. Jou, “All-
digital synchronization for SC/OFDM mode of IEEE 802.15.3c and
IEEE 802.11ad,” IEEE Trans. Circuits Syst. I, vol. 62, no. 2, pp. 545–
553, Feb 2015.
[7] R. Kimura et al., “Golay sequence aided channel estimation for
millimeter-wave WPAN systems,” in Personal, Indoor and Mobile
Radio Communications, 2008. PIMRC 2008. IEEE 19th International
Symposium on. IEEE, 2008, pp. 1–5.
[8] B. M. Popovic, “Efficient golay correlator,” Electronics Letters, vol. 35,
no. 17, pp. 1427–1428, 1999.
[9] A. Duel-Hallen and C. Heegard, “Delayed decision-feedback sequence
estimation,” IEEE Trans. Commun., vol. 37, no. 5, pp. 428–436, 1989.
[10] A. Maltsev et al., Channel Models for 60 GHz WLAN
Systems, 8th ed., IEEE, May 2010. [Online]. Avail-
able: https://mentor.ieee.org/802.11/dcn/09/11-09-0334-08-00ad-channel-
models-for-60-ghz-wlan-systems.doc
[11] M. P. Fossorier, F. Burkert, S. Lin, and J. Hagenauer, “On the equivalence
between SOVA and max-log-MAP decodings,” Communications Letters,
IEEE, vol. 2, no. 5, pp. 137–139, 1998.
[12] Balatsoukas-Stimming et al., “A parallelized layered QC-LDPC decoder
for IEEE 802.11 ad,” in New Circuits and Systems Conference (NEWCAS),
2013 IEEE 11th International. Ieee, 2013, pp. 1–4.
