A Signal Processor for Gaussian Message Passing by Kröll, Harald et al.
A Signal Processor for Gaussian Message Passing
Harald Kro¨ll∗, Stefan Zwicky∗, Reto Odermatt∗, Lukas Bruderer†, Andreas Burg‡, Qiuting Huang∗
∗Integrated Systems Laboratory, ETH Zurich
†Signal and Information Processing Laboratory, ETH Zurich
‡Telecommunication Circuits Laboratory, EPF Lausanne
Email: {kroell,zwicky,huang}@iis.ee.ethz.ch, bruderer@isi.ee.ethz.ch, andreas.burg@epfl.ch
Abstract—In this paper, we present a novel signal processing
unit built upon the theory of factor graphs, which is able to
address a wide range of signal processing algorithms. More
specifically, the demonstrated factor graph processor (FGP) is
tailored to Gaussian message passing algorithms. We show how
to use a highly configurable systolic array to solve the message
update equations of nodes in a factor graph efficiently. A proper
instruction set and compilation procedure is presented. In a
recursive least squares channel estimation example we show that
the FGP can compute a message update faster than a state-of-
the-art DSP. The results demonstrate the usabilty of the FGP
architecture as a flexible HW accelerator for signal-processing
and communication systems.
I. INTRODUCTION
Computationally demanding tasks such as channel estima-
tion or equalization are omnipresent in modern communi-
cation and signal-processing systems. Algorithms for such
tasks might be implemented on hardware accelerators, on
a programmable processor such as a DSP or on an ASIP
with a custom instruction set. Various attempts have been
reported to bridge the gap between hardwired accelerators
and programmable processors, where the trade-offs between
performance and programmability have been subject of many
studies [1]. The goal of this work is to design a signal
processor or accelerator with an appropriate instruction set,
which offers sufficient flexibility to perform a wide range of
common signal-processing tasks. To achieve this, we leverage
the theoretical framework of factor graphs (FGs) [2] and
message-passing algorithms. More specifically, we focus on
FGs that represent factorizations of (scaled) multivariate Gaus-
sian probability distributions.
In [3] it is shown how widely-used algorithms such as
recursive least squares (RLS), linear MMSE equalization, and
Kalman filtering can be expressed with Gaussian message-
passing (GMP) on a FG. GMP algorithms have applications
in various different fields such as wireless communications [4],
optical imaging [5], Time of Arrival (ToA) estimation [6],
multiuser detection [7] or compressed sensing [8].
A FG is an undirected graph, where each node represents a
function. In GMP messages, which are represented by a mean
vector m and a covariance matrix V, or a transformed mean
vector Wm and a weight matrix W are exchanged between
nodes of a graph. The computational load is distributed among
the nodes, which compute updates of the outgoing messages
based on the incoming messages. The various types of nodes
(i.e., functions) that are used in GMP are summarized in
Fig. 1. The figure also displays the message update rules
corresponding to each type of node.
Fig. 1. FG nodes and the corresponding message update rules for GMP. [3]
To the best knowledge of the authors, there exists no
programmable processing architecture with a datapath suited
for solving GMP algorithms.
In this paper, we present a novel application specific in-
struction processor termed FGP which is able to perform
the message updates for all basic types of nodes in GMP.
Therefore, complex algorithms described by GMP can be
readily executed on our processor. The computations in the
FGP are based on a systolic array implementation of the
Faddeev algorithm, which is able to efficiently compute all
necessary matrix operations. Eventually, we demonstrate an
appropriate compiler for the FGP and compare the throughput
of the proposed implementation with a state-of-the-art DSP
for a compound node message update.
II. FROM GMP NODES TO DATAPATH ARCHITECTURE
The FGP architecture is derived from the message update
rules of the nodes in Fig. 1. Compound nodes are composed
by two simple nodes. The expressions for the compound nodes
consist of the same basic matrix operations as the ones for the
simple nodes what facilitates the development of a datapath
which is suitable for all nodes. The data dependency graph for
the message update (covariance matrix only) computation of
a compound node is shown in Fig. 2.
The outgoing message VZ is computed with the incoming
messages VX , VY and a state matrix A defined the particular
ar
X
iv
:1
40
4.
31
62
v1
  [
cs
.A
R]
  1
1 A
pr
 20
14
node. The purple boxes highlight the computations to be car-
ried out, while the white boxes contain the intermediate results.
The computations can be decomposed into three types1:
• Matrix-matrix multiplication (e.g., VXAH )
• Matrix-matrix multiplication with addition/subtraction
(e.g., VY −A(VXAH))
• Schur complement (e.g., VX − (VXAH)G−1(AVX))
Fig. 2. Data dependency graph of operations in compound node.
A suitable architecture for matrix computations of such a
kind is the systolic array [9]. For the matrix-matrix multipli-
cation (VXAH ), a two-dimensional rectangular systolic array
as shown in the blue box in Fig. 5 is sufficient. The processing
elements (PEs) need to perform multiplications and additions.
This so-called PEmult is shown in Fig. 3.
PEmult
WestxDI
OpModexSI
StateReg
NorthxDI
accum
shift
swapno-swap
EOutReg
SOutReg
zeroMsgData
}
}
Faddeev algorithm
matrix multiplication,
matrix addition
EastxDO
OpModexSO
OpModeReg
PEmultArithmetics
clear
0
1
Fig. 3. PEmult with different operation modes.
It supports different operation modes which will be ex-
plained next and contains both a real-valued multiplier and a
real-valued adder/subtractor, which allows a complex-valued
multiplication to be executed in four cycles. The adder is
utilized in only two of the four cycles and in the remaining
two cycles an additional number can be added to the final
result. Hence, the second of the above mentioned computa-
tions (VY − A(VXAH)) can be completed with the same
rectangular array. Furthermore, the flow of data through the
array and the rate at which input data must be provided
is very regular. An additional advantage of this architecture
1Hermitian transp. and negation are not considered as a separate operations.
becomes apparent when a matrix multiplication is followed by
a matrix multiplication with subtraction as in the compound
node example of Fig. 2. In this case, the result of the first
matrix multiplication is stored in the StateReg of the PEmult.
The subsequent VY −A(VXAH) operation can immediately
be started since the matrix VXAH is already available and
the matrices A and VY can be shifted through the array from
the “north” and the “west”, respectively.
The Schur complement for the third operation mode is
computed with the Faddeev algorithm. The algorithm performs
triangularization of the input matrix followed by Gaussian
elimination. [9]. It is a favorable choice because it avoids direct
matrix inversions. To this end, the rectangular systolic array
requires a triangular extension as it is shown in the yellow box
in Fig 5. A different kind of PEs capable of computing the
absolute value and a complex division as shown in Fig. 4 is
required in addition the PEmult. During Gaussian elimination,
the PEmult performs the necessary swapping of the matrix
rows for the pivoting.
PEborder
PEborderArithmetics
StateReg
StatexDP
NxDI
StatexDI
DividendxDI
DivisorxDI
DecisionxSO
DivisorIsZeroxSO
NegQuotientxDO
SwapReg
OpModeFReg OpModeFDReg
OpModexSI
DelayedOpModexSO
SwapxSO
NorthxDI
zeroMsgData
zeroMsgData EastxDO0 1
0
1
StatexDNStatexD
DividendxD
DivisorxD
QuotientxD Neg
EOutReg
SwapxS
Comp
1
0
0
1
1
0
swap
clear
=0 ?
Norm1xD
Norm2xD
1
Fig. 4. PEborder with different operation modes.
The complex-valued division is carried out using the rela-
tionship a+bic+di =
ac+bd
c2+d2 +
bc−ad
c2+d2 i by employing one sequential
divider2, two multipliers and one adder.
In order to avoid the deployment of three different systolic
arrays we propose an architecture with a single systolic
array whose processing elements support multiple modes of
operation as explained in the following section.
III. FGP ARCHITECTURE
The FGP architecture is mainly determined by the opera-
tions of the GMP nodes. The systolic array with the presented
PEs is the core part of the FGP as shown in Fig. 5. Further
more, the processor contains a program memory (PM) and
memories for the messages and the state matrix A. Storing
intermediate results during the execution of an operation is not
required due to the systolic architecture where intermediate
2The divider performs a sequential radix-2 division in 4 cycles.
Program
memory
(128 isr)
Message memory
State memory
Instruction port
Data in port
Combined Systolic array
H
H
Buffer
Transpose Select Mask
M
M
Instruction 0
Instruction 1
Instruction 2
Instruction N
FSM
Fetch & decode
instruction
H
Command
Status
control
signals
Address port
Data out port
PEB PE
Fig. 5. FGP architecture with systolic array and memories. The Select and Mask units are used to feed data from different sources to the array. Processor
operation is controlled by a finite state machine whose state transitions are triggered from exernal commands. Internal Control signals are marked by blue
arrows.
results are stored in the state of the systolic array (i.e., the
result of the matrix multiplication in accum mode, which is
used as input to the matrix multiplication in shift mode and
as input to the Faddeev algorithm).
An instruction is fetched from the PM, decoded and for-
warded to a finite state machine which generates the necessary
control signals for the PEs as well as for the Transpose-,
Select- and Mask-unit. The processor operates on a proper
instruction set tailored to GMP called FGPAssembler, com-
prising six instructions which are summarized in Table I.
TABLE I
FGP ASSEMBLER INSTRUCTIONS
Datapath Control Instruction
mma Matrix multiplication & accumulate
mms Matrix multiplication & shift
fad Faddeev algorithm
Program Control Instruction
smm Store result of array in memory
loop loop over instructions (FG sections)
prg Program (define multiple programs)
They can be subdivided into instructions which control the
program flow and instructions which control the systolic array.
Each instruction of the latter type corresponds to one compu-
tation type of Section II. The arguments of the instructions are
the addresses of the input and output messages in the memory
as well as flags for the Hermitian transpose and negation. Since
many factor graphs show a repetitive pattern (e.g., RLS) an
instruction for looping over iterations is provided. In order
to host multiple programs in the PM, the prg instruction
was introduced to indicate the start addresses of the different
programs. For example a baseband receiver might store one
program for RLS channel estimation and another one for
symbol detection/equalization.
The FGP can be controlled from an external processor via
a set of commands. Each command gets replied by a status
message. Elementary commands are load_program and
start_program which load one or multiple programs from
the instruction port into the PM or start a program from the PM
respectively. In this way the FGP can be easily attached to an
existing system as an accelerator or a co-processor. The initial
input messages need to be loaded into the message memory
via the Data in port. After program execution, the results can
be obtained from the message memory through the Data out
port.
IV. HARDWARE-SOFTWARE INTERACTION EXAMPLE FOR
CHANNEL ESTIMATION
The desired GMP algorithm is first written in a high-level
language (Matlab) and then automatically compiled to FGP
Assembler code. This procedure is exemplified for an (Linear
Minimum Mean Squared Error) LMMSE channel estimation
example using the FG in Fig. 6 and the code listings 1 and 2.
For the sake of simplicity the RLS factor graph contains only
two sections. In the channel estimation example the messages
msg_Y correspond to the received symbols.
Fig. 6. Factor graph for channel estimation with two sections.
A message update schedule (Fig. 7 left) is first derived from
the high level description in Listing 1.
Listing 1. Matlab code for the channel estimation example
1 for i = 1:length(ytilde)
2 % observation message
3 msg_Ytilde.m = ytilde(i);
4
5 msg_Y = FGP.add_b(msg_Ytilde, msg_n);
6 msg_a_out = FGP.mult_eq_f(msg_a_in, msg_Y, A);
7
8 % Update input message for next slice
9 msg_a_in = msg_a_out;
10 end
Each message has an identifier assigned, which is mapped
to a memory address later on. In a second step, the schedule
is optimized to reduce the number of identifiers and hence
the size of the message memory (Fig. 7 right). Sequentially,
for each output message, the set of identifiers assigned to
messages that are no longer needed is considered. A score is
computed for each identifier in the set and the output message
will be remapped to the identifier having the highest score.
The optimized schedule is then compiled into an assembly
language program (Listing 2). This program is compressed
using the loop instruction and converted into a binary memory
image suitable for loading into the processor.
Listing 2. FGP Assembler code for the channel estimation example
1 prg 1
2 loop 1 1
3 mma 1 1 c 0 1 e 0 0 0
4 mms 0 1 d 0 1 e 1 0 0
5 smm 1 1 d 0
6 mma 0 4 d 0 4 c 0 1 0
7 mms 1 1 d 0 4 c 1 0 0
8 fad 0 4 d 1
9 smm 0 4 d 1
add_b_0
msg_2_s1
mult_eq_f_1
msg_1_s4
add_b_2
msg_4_s1
mult_eq_f_3
msg_2_s4
msg_1_s1msg_0_s1
msg_0_s4
msg_3_s1
out_msg
in_noise
in_msg
ytilde_0ytilde_1
a_0
a_1
add_b_0
msg_1_s1
mult_eq_f_1
msg_0_s4
add_b_2
mult_eq_f_3
msg_0_s1
out_msg
msg_2_s1
in_noise
in_msg
ytilde_0
ytilde_1
a_0
a_1
Fig. 7. Compiler-generated RLS message computation graph with memory
locations. Left: Unoptimized. Right: Optimized.
V. IMPLEMENTATION RESULTS
As a proof of concept the design was synthesized for the
UMC180 CMOS technology for a state matrix size of 4x4 and
compared to a state-of-the-art DSP in terms of throughput.
As a performance metric the number of message updates on
a compound node (CN) per second was chosen because the
compound node is the GMP node with the highest computa-
tional load. The number of cycles the C66x DSP [10] would
take for execution is estimated using the DSPs fixed-point
instruction set. According to [11], 768 cycles for the inversion
of a complex 4x4 matrix are assumed. The comparison is
shown in Table II.
TABLE II
THROUGHPUT COMPARISON, FGP VS DSP
Processor FGP (this work) TI C66x [10]
CMOS technology [nm] 180 40
Max. freq. [MHz] 130 1250
cycles for compound-node msg. update 260 1076
Normalized max. throughput3 [CN/s] 2.25·106 1.16·106
The FGP occupies an area of 3.11 mm2 of which 30% are
memories, 60% systolic array and 10% datapath and control
logic. The maximal clock speed is 130 MHz. The throughput
of the C66x is computed at a clock frequency of 1.25GHz [10].
Both the C66x DSP and the FGP operate in fix point number
representation and have 64kbit of memory.
Due to the usage of the Faddeev algorithm on the systolic
architecture and the usage of a dedicated radix-2 divider, the
FGP is able to compute a compound node message update in
only 260 cycles. Computing the Schur complement using the
Faddeev algorithm is more efficient than computing the inverse
and the summands in the Schur complement separately on a
conventional DSP instruction set. This leads to a 2x higher
throughput for the FGP as can be seen in Table II.
VI. CONCLUSION
The FGP is a promising processing architecture built on the
theoretical GMP framework. It is able to run a wide range of
signal processing algorithms. The use of a highly configurable
systolic array architecture with a dedicated radix-2 divider
allows an efficient computation of message updates on a
Gaussian factor graph. The presented approach does not only
allow to use a low-complexity fixed-point systolic architecture,
it also just needs a very simple compilation procedure instead
of a complex tool-chain, since the user specifies the message
update rules in the high-level representation of the algorithm.
In a simple channel estimation example we have shown how
the FGP achieves a 2x higher throughput over a state-of-the art
DSP. However, neither the presented processor architecture nor
it’s applications are fully explored. Future work will include
optimizations on architectural level which further exploit the
characteristics of GMP, as well as the implementation and
performance comparison for other algorithms.
REFERENCES
[1] K. Fan, M. Kudlur, G. Dasika, and S. Mahlke, “Bridging the computation
gap between programmable processors and hardwired accelerators,” in
High Performance Computer Architecture, 2009. HPCA 2009. IEEE 15th
Int. Symp. on, 2009, pp. 313–322.
[2] F. R. Kschischang, B. J. Frey, and H.-A. Loeliger, “Factor graphs and
the sum-product algorithm,” Information Theory, IEEE Transactions on,
vol. 47, no. 2, pp. 498–519, 2001.
3Technology scaling to 180nm CMOS assuming tpd ∼ 1/s.
[3] H.-A. Loeliger, J. Dauwels, J. Hu, S. Korl, L. Ping, and F. R. Kschis-
chang, “The factor graph approach to model-based signal processing,”
Proceedings of the IEEE, vol. 95, no. 6, pp. 1295–1322, 2007.
[4] G. E. Kirkelund, C. N. Mancho´n, L. P. Christensen, E. Riegler, and B. H.
Fleury, “Variational message-passing for joint channel estimation and
decoding in MIMO-OFDM,” in Global Telecommunications Conference
(GLOBECOM 2010), 2010 IEEE. IEEE, 2010, pp. 1–6.
[5] Z. Jingshan, J. Dauwels, M. A. Va´zquez, and L. Waller, “Efficient
gaussian inference algorithms for phase imaging,” Proc. IEEE ICASSP
2012, pp. 25–30.
[6] H.-L. Jhi, J.-C. Chen, C.-H. Lin, and C.-T. Huang, “A factor-graph-based
TOA location estimator,” Wireless Communications, IEEE Transactions
on, vol. 11, no. 5, pp. 1764–1773, 2012.
[7] D. Guo and C.-C. Wang, “Multiuser detection of sparsely spread
CDMA,” Selected Areas in Communications, IEEE Journal on, vol. 26,
no. 3, pp. 421–431, 2008.
[8] D. L. Donoho, A. Maleki, and A. Montanari, “Message-passing algo-
rithms for compressed sensing,” Proceedings of the National Academy
of Sciences, vol. 106, no. 45, pp. 18 914–18 919, 2009.
[9] F. Gaston and G. Irwin, “Systolic kalman filtering: an overview,” in
Control Theory and Applications, IEE Proceedings D, vol. 137, no. 4.
IET, 1990, pp. 235–244.
[10] R. Damodaran, T. Anderson, S. Agarwala, R. Venkatasubramanian,
M. Gill, D. Gopalakrishnan, A. Hill, A. Chachad, D. Balasubramanian,
N. Bhoria et al., “A 1.25 GHz 0.8W C66x DSP core in 40nm CMOS,”
in VLSI Design (VLSID), 2012 25th Int. Conf. on. IEEE, pp. 286–291.
[11] M. Yan, B. Feng, and T. Song, “On matrix inversion for LTE MIMO
applications using texas instruments floating point DSP,” in Signal
Processing (ICSP), 2010 IEEE 10th Int. Conf. on, 2010, pp. 633–636.
