Sampling architectures for probabilistic inference by Ko, Glenn Gihyun
c© 2017 Glenn Gihyun Ko




Submitted in partial fulfillment of the requirements
for the degree of Doctor of Philosophy in Electrical and Computer Engineering
in the Graduate College of the
University of Illinois at Urbana-Champaign, 2017
Urbana, Illinois
Doctoral Committee:
Professor Rob A. Rutenbar, Chair
Professor Deming Chen
Professor Naresh R. Shanbhag
Associate Professor Paris Smaragdis
Dr. Eriko Nurvitadhi, Intel Labs
Abstract
In recent years, machine learning (ML) algorithms for applications such as
computer vision, machine listening, topic modeling (i.e., extraction) from
large text data sets, etc., have proven to be effective in terms of perceived
quality. However, these ML applications tend to be compute-intensive and
create performance challenges. We focus on hardware accelerator architec-
tures for inference on probabilistic graphical models, in particular for Markov
random field (MRF) and for latent Dirichlet allocation (LDA). Our work fo-
cuses on inference via sampling methods, in particular, Markov chain Monte
Carlo (MCMC) methods. Roughly speaking, we generate samples from the
distribution of labels implied by the structure of the graphical model, and
use results computed from the samples to approximate the results we seek.
MCMC methods are extraordinarily popular in inference tasks and are widely
used, especially with very large models, but they are not commonly seen as
either “fast” or “low power” - which are the challenges we seek to address.
However, performance is not our only concern.
We focus on two applications to drive this research. First, we explore sound
source separation, which can be used to separate a human voice from back-
ground noise on mobile phones, e.g. talking on your cell phone in an airport.
The challenges involved are real-time execution and power constraints. As a
solution, we present a novel hardware-based sound source separation imple-
mentation capable of real-time streaming performance. The implementation
uses a Markov random field inference formulation of foreground/background
separation, and targets voice separation on mobile phones with two micro-
phones. We demonstrate a real-time streaming FPGA implementation run-
ning at 150 MHz with total of 207 KB RAM. Our implementation achieves
a speedup of 22X over a conventional software implementation, achieves an
SDR of 7.021 dB with 1.6 ms latency, and exhibits excellent perceived audio
quality. A virtual ASIC design shows that this architecture is quite small
ii
(less than 10 million gates), and consumes only 70 mW and appears amenable
to additional optimization for power.
The second application is an enterprise-scale application, topic modeling,
which is used to extract hidden thematic structure of large sets of documents.
Enterprise-scale clusters are usually required to run such massive tasks. We
would like to explore the potential benefits of accelerating topic models and
provide speed/power trade-offs of building hardware accelerators for them.
We took a latent Dirichlet allocation, a probabilistic topic model, and Gibbs
sampling inference implementation and profiled it to show that 96% of the
run-time is spent sampling, which would be the main focus of the acceler-
ation. We describe a parallel architecture on a FPGA that is theoretically
only bounded by memory bandwidth running at 220 MHz and where even a
single core is faster than workstation-grade CPU cores.
Lastly, we share our findings on accelerating parallel versions of the Gibbs
sampling algorithm and also look at precision requirements and potential for
huge reduction in number of bits used to perform Gibbs sampling inference
on applications such as source separation. We implement a multi-threaded
C++ and CUDA GPU implementation of chromatic Gibbs sampling which
is a parallel version of Gibbs sampling that uses a graph-coloring scheme to
construct Markov chains that can be executed in parallel. We show 1.9X
and 22.9X speedups respectively, compared to a conventional single core
version running on Intel Xeon. Furthermore, our analysis of the precision
and dynamic range of the source separation application showed that we only
required 8-bit reduced floating-point to maintain a very low decision error
rate on the Gibbs sampler. These early results suggest reduced precision
asynchronous Gibbs sampling architectures.
iii
To my parents, for their love and support.
iv
Acknowledgments
This journey has been a long one. I initially started graduate school in 2004
right out of college in a MS-PhD program of the Department of Electrical
and Computer Engineering. I left in 2006 with just a master’s degree to
experience the industry. I never imagined that after more than a decade, I
would finally be ending my journey at the University of Illinois at Urbana-
Champaign with this dissertation.
I’m deeply grateful for my parents who always have supported me and
motivated me to consider returning to school to finish what I had started.
I also am thankful for my brother, Bernie, for his crucial help when I faced
non-academic obstacles. It was through the love and support of my family
that I was able to continue this journey through the difficult times.
My advisor, Professor Rob A. Rutenbar, has been the best advisor one
could ask for. None of this would be possible without his trusting me to do
research with him. His patience and continued support have been the pillar
of my persistence. I’m sad to be leaving his guidance as I feel like I still have
so much more to learn from him. I will always be grateful to have received
his advice and guidance.
I would also like to thank the rest of the committee. Professor Naresh
Shanbhag provided me with much insightful feedback on my work through-
out the PhD program. He also served as my administrative academic advisor
in Electrical and Computer Engineering, as Professor Rutenbar did in Com-
puter Science. Professor Smaragdis was someone I could always count on
for any machine learning or audio related questions. Professor Deming Chen
and Dr. Eriko Nurvitadhi of Intel Labs have also provided me with a lot of
helpful feedback with their expertise in FPGA.
I was blessed with wonderful labmates throughout my PhD: Jungwook
Choi, Abner Guzmán-Rivera, Shang-nien Tsai, Tianqi Gao, Adel Ejjeh and
Sunwoong Kim. It was definitely a treat to be around incredibly brilliant
v
researchers like them all the time. They were my labmates but also friends
with whom I spent countless days discussing research, taking classes and
grabbing food together. I should not forget Minje Kim, an expert in source
separation, for our collaboration, theoretical insights and our discussions on
the future direction of our research.
I also want to thank my friends from ECESAC and KSA. We were able
to make lasting changes and serve the community which was the best non-
academic accomplishment of the journey. It also helped me stay sane and
was a vehicle for channeling research frustration into something useful.
Lastly, I thank all my friends in Seoul, New York and Silicon Valley, who
have encouraged me and inspired me. I’m grateful to have such friends who
I can have fun with but also learn from as you excel in all your respective
fields.
Thank you, University of Illinois at Urbana-Champaign, for all three of
my degrees and providing the most memorable time of my life.
vi
Table of Contents
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
List of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Contribution of the Dissertation . . . . . . . . . . . . . . . . . 4
1.2 Organization of the Dissertation . . . . . . . . . . . . . . . . . 4
Chapter 2 Probabilistic Modeling, Inference and Computing Arith-
metics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1 Probabilistic Modeling . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Probabilistic Inference . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Probabilistic Graphical Models . . . . . . . . . . . . . . . . . 7
2.4 Algorithms for Inference . . . . . . . . . . . . . . . . . . . . . 8
2.5 Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.6 Computing Arithmetics . . . . . . . . . . . . . . . . . . . . . . 10
2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Chapter 3 Architecture for Mobile Perceptual Application . . . . . . 13
3.1 Sound Source Separation . . . . . . . . . . . . . . . . . . . . . 14
3.2 BSS as Markov Random Field . . . . . . . . . . . . . . . . . . 15
3.3 Gibbs Sampling Inference and Parameter Estimation . . . . . 18
3.4 Profiling Source Separation . . . . . . . . . . . . . . . . . . . . 20
3.5 Real-Time Streaming Implementation . . . . . . . . . . . . . . 21
3.6 Gibbs Sampling Hardware Accelerator Architecture . . . . . . 28
3.7 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . 30
3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Chapter 4 Architecture for a Large-Scale Cloud Application . . . . . 41
4.1 Topic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 Latent Dirichlet Allocation . . . . . . . . . . . . . . . . . . . . 43
4.3 Collapsed Gibbs Sampling . . . . . . . . . . . . . . . . . . . . 46
4.4 Why LDA Hardware Acceleration . . . . . . . . . . . . . . . . 48
vii
4.5 Profiling a Simplistic LDA Inference Engine . . . . . . . . . . 50
4.6 FlexLDA Architecture . . . . . . . . . . . . . . . . . . . . . . 51
4.7 Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.8 FPGA Implementation . . . . . . . . . . . . . . . . . . . . . . 57
4.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Chapter 5 Towards Parallel Gibbs Sampling . . . . . . . . . . . . . . 60
5.1 Parallel Gibbs Sampling MRF Inference . . . . . . . . . . . . 60
5.2 Multi-core CPU Implementation . . . . . . . . . . . . . . . . . 62
5.3 GPU Implementation . . . . . . . . . . . . . . . . . . . . . . . 64
5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Chapter 6 Reduced Precision Gibbs Sampling . . . . . . . . . . . . . 70
6.1 Numerical Profiling for a Gibbs Sampling Engine . . . . . . . 71
6.2 Bit-Reduced Floating-Point for Sampling . . . . . . . . . . . . 73
6.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Chapter 7 Conclusions and Future Work . . . . . . . . . . . . . . . . 75
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
viii
List of Tables
3.1 FPGA resource utilization. . . . . . . . . . . . . . . . . . . . . 33
3.2 Memory instances. . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3 A comparison and a breakdown of power for different de-
signs and different types of cells. 12tr-rvt-150m is IBM
45nm process with 12 track RVT cells running at 150 MHz.
12tr-rvt-30m is IBM 45nm process with 12 track RVT cells
running at 30 MHz. 9tr-uvt-150m is IBM 45 nm process
with 9 track UVT cells running at 150 MHz. 9tr-uvt-30m
is IBM 45 nm process with 9 track UVT cells running at
150 MHz. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4 32nm memory instances . . . . . . . . . . . . . . . . . . . . . 36
3.5 A comparison with the related work. Tokkola’s BSS [1]
has the longest latency. FastICA [2] has improved latency
over [1] and supports higher sampling rates not shown in
this table. Note that FastICA is listed twice for different
sampling rates. . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.1 XCVY190-2FLGC2104E FPGA . . . . . . . . . . . . . . . . . 58
4.2 Resources Utilization . . . . . . . . . . . . . . . . . . . . . . . 58
ix
List of Figures
1.1 History of AI and computing. . . . . . . . . . . . . . . . . . . 2
2.1 (a) IEEE single floating-point representation. (b) Q9.23 or
Q23 32-bit signed fixed-point representation. . . . . . . . . . . 11
3.1 A binary masking on spectrograms generated from two
(right and left) microphones and sound sources. . . . . . . . . 15
3.2 A partial representation of pairwise MRF that describes
the relationship between nine observed variables yi (shaded
nodes) from the input data and nine hidden variables xi
(white nodes) is shown. The likelihood distribution is rep-
resented by edges between each xi and yi pair while the
prior distribution is described by edges between each xi
and its neighboring nodes. . . . . . . . . . . . . . . . . . . . . 16
3.3 Sound source separation execution profile. . . . . . . . . . . . 21
3.4 Streaming source separation. . . . . . . . . . . . . . . . . . . . 22
3.5 ILD generation block. It takes real and imaginary parts of
the STFTs generated from sound mixtures captured by the
stereo microphones. Each of the arithmetic function units
is registered to allow pipelining. A constant of 10 log10(2)
is multiplied with result of log2 unit to produce log10 results. . 23
3.6 Signal-to-distortion-ratio for varying number of frames (columns)
of the MRF. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.7 Data cost generation block. All arithmetic function units
are registered for pipelining. . . . . . . . . . . . . . . . . . . . 26
3.8 The constructed MRF contains information from eight in-
put sound frames: one current and seven previous frames.
When a new sound frame enters the system, the column of
MRF corresponding to the oldest frame in the MRF is re-
tired and new frame fills the void. This enables the stream-
ing implementation with minimal hardware overhead while
taking advantage of the temporal information from previ-
ous frames for faster convergence and better accuracy. . . . . . 27
x
3.9 The architecture of the Gibbs sampling inference block and
the Gibbs sampler. (a) Caches local copy of all the vari-
ables of MRF within the block. The Gibbs pipeline creates
a list of all nodes to be sampled along with indices, data
costs, labels and edges per node. (b) Receives labels of the
neighboring pixels and also the existence of edges to its 4
neighbors. The exponential function is implemented sim-
ilarly to [3]. The pseudo random number generator is an
implementation of [4]. . . . . . . . . . . . . . . . . . . . . . . 38
3.10 Gaussian parameters update block. It takes cached ILD
values along with labels corresponding to same node of
the graph to update Gaussian parameters based on Gibbs
sampling inference results. . . . . . . . . . . . . . . . . . . . . 39
3.11 Streaming Gibbs sampling source separation architecture
showing all the intermediate buffers. . . . . . . . . . . . . . . 39
3.12 A 4.45x3.55x2.5 room with a mobile phone and five sound
sources. Source 1 is the desired speech signal, Source 2 is
the interfering speech signal, Source 3 is a blender, Source
4 is a rolling can and Source 5 is a washer. . . . . . . . . . . . 40
3.13 The ASIC layout using 45 nm 9-track UVT library. It is
evident that our design uses a lot of SRAM, shown by the
light blue blocks. The logic cells are distributed in the
gridded area in the middle . . . . . . . . . . . . . . . . . . . . 40
4.1 Probabilistic graphical model for latent Dirichlet allocation. . 46
4.2 Profiling result for GibbsLDA++ library show that more
than 96% of time is spent on Gibbs sampling. . . . . . . . . . 51
4.3 Three stages of LDA accelerator for processing one word token 52
4.4 Top level architecture for sampling word token. All the
blocks are connected through a common bus. The top three
blocks perform the three instructions for sampling a word
token by reading in cached values from the distribution
tables, the bottom four blocks. . . . . . . . . . . . . . . . . . . 53
4.5 Update table block. The left side of the block diagram com-
putes the addresses used to find the values corresponding
to the word from the distribution tables. The right side
takes the read values and either decrements or increments
them to produce new values to be written back to the dis-
tribution tables. . . . . . . . . . . . . . . . . . . . . . . . . . . 54
xi
4.6 Gibbs sampling block uses the document number, the total
number of topics, and the word, to calculate addresses for
reading in values from the distribution tables, as shown on
the left side. The addresses are updated every cycle and
values are streamed into FIFOs and fed into the sample
new topic block. . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.7 Conditional distribution computation block streams the
word-topic and document-topic distribution table values in
with beta and alpha values. A tree of arithmetic operations
will be executed for a conditional probability correspond-
ing to a topic. . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.8 Sample new topic block takes each of the conditional prob-
abilities found for each topic and samples the new topic for
the input values entered. . . . . . . . . . . . . . . . . . . . . . 57
5.1 Chromatic Gibbs inference using Pthreads. The graph is
divided into N horizontal regions and each region is pro-
cessed with chromatic Gibbs sampler. . . . . . . . . . . . . . . 62
5.2 The execution time of the chromatic Gibbs inference on a
multi-core CPU system. For lower number of threads such
as two or three, the overhead of using multiple threads
outweighs the benefits. . . . . . . . . . . . . . . . . . . . . . . 63
5.3 Grid, block and thread organization. Each thread pro-
cesses 4x4 subgraph with 8x8 threads making up a block. . . . 65
5.4 The labels from the neighboring nodes of the 4x4 subgraph
are copied from shared memory. . . . . . . . . . . . . . . . . . 65
5.5 Runtime for varying width of the source separation MRF.
From column sizes 8 to 64, there is little increase because
initial overhead has not been masked by parallelism as
there are not enough thread occupancies within the stream-
ing multiprocessor. . . . . . . . . . . . . . . . . . . . . . . . . 66
5.6 Runtime for blocks with horizontal orientation . . . . . . . . . 67
5.7 Runtime for blocks with vertical orientation. . . . . . . . . . . 68
6.1 (a) Histogram of input data cost. (b) Normalized his-
togram of input data cost. . . . . . . . . . . . . . . . . . . . . 72
6.2 (a) Histogram of probabilities. (b) Normalized histogram
of probabilities. . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.3 Decision error rate for bit-reduced floating-point represen-
tations. The number of mantissa (or significand) bits from
1 to 23 and the number of exponential bits from 1 to 8 are




ASIC Application Specific Integrated Circuit
BRAM Block Random-Access Memory
BSS Blind Source Separation
CASR Cellular Automata Shift Register
CPU Central Processing Unit
CUDA Compute Unified Device Architecture
DNN Deep Neural Networks
DRAM Dynamic Random-Access Memory
EM Expectation-Maximization
FIFO First-In First-Out
FFT Fast Fourier Transform
FPGA Field-Programmable Gate Array
FPU Floating-Point Unit
GPGPU General-Purpose Graphics Processing Unit
GPU Graphics Processing Unit
IFFT Inverse Fast Fourier Transform
ILD Interaural Level Difference
ISTFT Inverse Short-Time Fourier Transform
LDA Latent Dirichlet Allocation
xiii
LSA Latent Semantic Analysis
LSI Latent Semantic Indexing
LFSR Linear Feedback Shift Register
LUT Look-Up Table
MAP Maximum A Posteriori
MCMC Markov Chain Monte Carlo
ML Machine Learning
MRF Markov Random Fields
PGM Probabilistic Graphical Model
PLSA Probabilistic Latent Semantic Analysis




SRAM Static Random-Access Memory
STFT Short-Time Fourier Transform




We are living in an era where artificial intelligence (AI), or machine learning
(ML) to be more precise, is eating the world. Most often you will hear re-
searchers of various fields talk about how they applied machine learning to
solve a problem or improve their solution by orders of magnitude. This explo-
sion of machine learning has been enabled partially by the growth of econom-
ical computing power available to researchers as shown in Figure 1.1. Mul-
ticores, GPUs, and FPGAs have all provided significantly cheaper compute
resources and along with cheaper storage devices allowed many researchers
to try more complex and sophisticated algorithms that was not possible in
the past. One of the most famous examples is deep neural networks (DNNs)
and its application to image classification. While the concept of deep neural
networks has been around for decades, it was not popularized until recently
when GPU-based work by Krizhevsky and Hinton [5] won the large-scale
ImageNet competition [6] by a large margin over shallow machine learning
methods. Furthermore, more recent work has shown accuracy levels close to
or even better than that of humans [7]. Since then and recently, there has
been a tremendous amount of research done by both academia and industry,
to accelerate deep neural networks by hardware accelerators using GPUs,
FPGAs and ASICs.
As mentioned above, advancement of deep neural networks can be at-
tributed to increased accessibility of GPGPU, as the basic compute kernel is
matrix multiplication which GPUs excel in. However, there are many other
widely used and powerful machine learning algorithms, such as probabilistic
graphical models (PGMs), that do not scale well on GPUs. A PGM is a
combination of probabilistic models and graph theory, where the assump-
tions and relationships between variables can be expressed in a graph, and
is widely used in areas such as computer vision, speech recognition, commu-
nication, and bioinformatics. However, most often these graphs do not have
1
Figure 1.1: History of AI and computing.
a matrix-like structure and the inference on the them is intractable as they
describe a distribution with large number of variables as each node in a graph
can represent a variable with multiple possible values. This is why we are
interested in hardware accelerators for inference on PGMs and different archi-
tectures for different platforms. For ML tasks running on a mobile platform,
we might be concerned about real-time execution and power constraints. For
tasks running on larger, enterprise-scale platforms (e.g., on a server in a data
center), we might be mainly concerned about speed/power tradeoffs. While
there are always opportunities to exploit the parallelism of these algorithms,
other researchers have examined opportunities for conventional thread-level
software parallelism using standard multi-core or distributed hardware plat-
forms. Our interests are in hardware acceleration, where we seek to “throw
transistors” at the problem in an optimal way, to achieve a best-possible
speed or power result.
In this thesis, we focus on one key area of ML: inference on probabilistic
graphical models. Graphical models are, in fact, large graphs, composed of
nodes and edges. Nodes represent what we know about some problem, for
example, in the form of a set of discrete possible labels for every node. The
graphs also represent essential probabilistic relationships known for individ-
ual nodes (labels), and among sets of nodes, via the information carried on
the edges in the graph. These statistical relationships can be manipulated
to compute how much we “believe” each label is the proper one for each
node - we can compute distributions for the labels on each node, and we can
compute a maximum a posteriori (MAP) result for each node, which allows
us to answer questions like this: If we observe some labels for some of the
nodes, and we know the appropriate probability relationships on/among the
nodes, what is the most likely label for each of the unobserved nodes? This
computing task is a supremely important operation in ML: this is inference.
At the most abstract level, our thesis focuses on inference architectures for
2
probabilistic graphical models.
Inference can be solved by a variety of methods. For example, belief propa-
gation [8] iteratively sends messages along the edges to compute the relevant
beliefs; this has already been the subject of significant efforts in hardware
acceleration [9, 10], including work in our own group [11, 12, 13]. Graph cuts
methods reduce the inference problem to a network flow problem, which can
be exactly solved for some binary label problems, and approximately solved
for more general problems [14]. Our group also has a hardware accelerator
for graph cuts [15]. Variational methods take complex graphical models and
replace them with simpler approximate ones, that have easier probability
relationships to manage for inference. Our own work focuses on the impor-
tant class of sampling methods, in particular, Markov chain Monte Carlo
(MCMC) methods [16, 17]. Roughly speaking, we generate samples from
the distribution of labels implied by the structure of the graphical model,
and use results computed from the samples to approximate the results we
seek. MCMC methods are extraordinarily popular in inference tasks, and
are widely used - especially with very large models. But, they are not com-
monly seen as either “fast” or “low power” - which is an opportunity area
for us. Also, MCMC algorithms are mathematically sequential by nature
if you are to maintain the ergodicity. We study parallel implementations
of MCMC methods that minimize the effect on application accuracy while
boosting performance. And we also hypothesize that, because these meth-
ods rely on statistics calculated over many random samples, there is perhaps
some intrinsic robustness that can be taken advantage of when optimizing
the hardware for them.
Our thesis focuses on two different applications to drive our overall research
agenda: a mobile perceptual application in machine listening, intended to re-
side on a cell phone, and an enterprise-scale application kernel, that extracts
topic models from large text data sets. For each of these applications, we seek
to demonstrate a working hardware architecture, understand its speed and
power performance. We also share insight on parallelization options for Gibbs
sampling inference including asynchronous Gibbs sampling by discussing the
results of accelerating a graph-coloring based parallel Gibbs sampling.
3
1.1 Contribution of the Dissertation
In this dissertation, we target design challenges in several different layers of
abstraction and share our contributions in each of them.
• Algorithm Create a novel algorithm if possible, but if not, have some
algorithmic changes or improvements that could lead to relaxation of
hardware requirements or vast improvements with good accuracy trade-
offs.
• Architecture Build efficient architecture for different requirements
and specifications based on practical mobile and cloud use case. Mobile
will have stronger preference on low-power versus throughput for cloud.
How does that affect the architectures for Gibbs sampling inference?
• Number Representations Determine adequate types (floating-point
or fixed-point) of arithmetic representation and configurations for them.
These configurations include adjusting total number of bits, number of
bits for different parts such as integer, fractional, mantissa, exponents,
etc.
• Parallelism Find ways to enable parallelism in Gibbs sampling algo-
rithm in both algorithmic and architectural level. How well do parallel
architectures scale and potential implications for asynchronous Gibbs
sampling?
1.2 Organization of the Dissertation
This dissertation is organized as follows:
• Chapter 2 reviews the mathematical background for graphical models.
The basics of probability, probabilistic modeling, and probabilistic in-
ference are provided. The sampling-based inference methods are intro-
duced and examples of these sampling inferences used in probabilistic
graphical models are given.
• Chapter 3 shows a novel inference hardware accelerator created for a
mobile perceptual application. Sound source separation is performed
4
via Markov random field MAP inference using a Gibbs sampling engine.
The design considerations and architecture decisions to enable a low-
power real-time streaming separation are given as well as performance
and power results of FPGA and ASIC implementations.
• Chapter 4 provides FPGA implementation to identify a potential
enterprise-scale cloud ML sampling application. Topic models, which
are a probabilistic model used for discovering abstract “topics” in a
collection of text documents, are now being used in all disciplines and
companies. Latent Dirichlet allocation (LDA) using Gibbs sampling
is one of the most commonly used methods for topic extraction. We
explore some ideas for using LDA as an “enterprise ML kernel” to allow
us to study speed and power.
• Chapter 5 discusses ways to parallelize Gibbs sampling algorithms.
In particular, we consider chromatic Gibbs sampling, which is a graph-
coloring algorithm for creating Markov chains that can be sampled in
parallel. We test this algorithm on sound source separation to study
performance.
• Chapter 6 presents an idea of very small bit-reduced floating-point
computes to be used for Gibbs sampling inference. Error analysis on
source separation is given along with some discussion of implications.
• Chapter 7 provides the summary and considerations for future work.
5
Chapter 2
Probabilistic Modeling, Inference and
Computing Arithmetics
Before we describe the sampling architectures created, we first establish the-
oretical foundation underlying the entirety of this dissertation. Probabilistic
modeling is a very important technique in machine learning that provides
a method of imposing statistical structure for a given problem or an appli-
cation. Using probabilistic modeling, the hidden structure of the problem
can be described as a conditional distribution given the observed data. Once
the model is in place, the process of reasoning on the model, or probabilistic
inference, is required to find the desired answers. We explain how we have
decided to focus on our choice of inference method and further provide some
basic number representation formats to prepare for the discussions in the
following chapters.
2.1 Probabilistic Modeling
Probabilistic modeling applies the principles of probability theory and statis-
tics to capture statistical dependencies between uncertainties. In a proba-
bilistic model, the data are represented as observed random variables and a
joint distribution that encompass hidden random variables. The conditional
distribution of the latent variables given the observed variables, or the pos-
terior, is desired to find the hidden structure of the problem. A probabilistic
model can also be described by its graphical model or generative process.
Assuming two random variables X and Y , a joint distribution of them can
be written as
P (X, Y ). (2.1)
If the variables X and Y are independent, then the joint distribution is
6
described as
P (X, Y ) = P (X)P (Y ), (2.2)
a product of two distributions of X and Y . If the random variable Y is
the observed variable and X is the latent variable that describes the hidden
structure from Y , then the variables are no longer independent and can be
described as a product of a conditional and a marginal distribution:
P (X, Y ) = P (Y |X)P (X). (2.3)
The conditional or the likelihood, P (Y |X), describes how the observed data
Y depends on latent variable X. The marginal or the prior, P (X), represents
the latent variable hidden in the data.
2.2 Probabilistic Inference
In order to find the joint distribution, the distribution of the latent variable
is required. And since the distribution of the latent variable is unknown,
we “infer” this via the posterior, or the conditional distribution of the la-
tent variables given the observed variables. We call the process of finding
this distribution, the probabilistic inference. Bayes’ celebrated theorem [18]
describes this conditional distribution, using the likelihood and the prior,
normalized by the distribution of the observed variables and is given as
P (X|Y ) = P (Y |X)P (X)
P (Y )
, (2.4)
where the marginal, or the normalizing constant, is given by
P (Y ) =
∫
x
P (Y |X = x)P (X = x). (2.5)
2.3 Probabilistic Graphical Models
Although the rules of probability can be used to define and reason any proba-
bilistic models that involve any number of variables, it is inconvenient to write
7
down joint distributions with large number of variables. Furthermore, as the
number of variables increases, the resources required to represent and reason
on the model grow exponentially. Probabilistic graphical models (PGMs) pro-
vide a formal language for describing probability distributions that simplifies
representation and reasoning of the probabilistic models [19]. The statistical
dependencies among variables are represented graphically using nodes and
edges between them.
There are two major classes of graphical models: directed graphical mod-
els and undirected graphical models. Directed graphical models, often called
Bayesian networks, consists of nodes representing random variables and di-
rected edges describing dependencies between the variables. The undirected
graphical models, or Markov networks, use undirected edges instead of di-
rected. In this work, both undirected and directed graphical models will
be used to model useful applications and be reasoned on using probabilistic
inference.
2.4 Algorithms for Inference
Depending on the complexity of the probabilistic model, integrating out the
latent variable X from the normalizing constant can be intractable and is one
of the prime computational challenges of probabilistic inference. The nor-
malizing constant often does not have an analytical or a closed-form solution;
however, numerous approximate inference methods have been developed that
allow us to approximately solve it.
There are two major families of algorithms for approximating intractable
integrals: Markov chain Monte Carlo (MCMC) and variational methods [20].
The MCMC methods, or sampling methods, provide a numerical approxima-
tion to the exact posterior using a set of samples derived from an approximate
distribution. They have the advantage of being nonparametic and asymp-
totically exact. On the other hand, the variational methods provide a locally
optimal and exact analytical solution to an approximation of the posterior.
They have the advantage of maximizing an explicit objective and being fast
in most cases. While the variational methods are most often faster than
MCMC methods with comparable accuracy, deriving the set of equations for
updating the parameters involves significant effort. As the primary disad-
8
Algorithm 1 Gibbs sampling algorithm
1: procedure GibbsSampling(x)
2: for t = 1 to T do
3: for i = 1 to n do
4: x
(t+1)
i ∼ P (xi|x
(t+1)











vantage of using MCMC compared to variational methods is the speed, and
using variational methods requires additional work to derive the objective,
MCMC methods look more attractive for hardware acceleration.
2.5 Gibbs Sampling
Markov chain Monte Carlo (MCMC) methods are a class of algorithms for
sampling from a probability distribution from a constructed Markov chain
with the desired distribution. The key idea is to create a long sequence of
random samples, or a Markov chain, where each sample depends on the pre-
vious sample and its results [21]. From the population of samples, we can
calculate useful information about the desired distribution of the latent vari-
ables. In this work, we take the most popular MCMC method called Gibbs
sampling [22] and study the architectures for different classes of machine
learning applications for varying physical platforms.
Gibbs sampling, a variant of Markov chain Monte Carlo (MCMC) algo-
rithm, was first introduced in the context of computer vision by Geman and
Geman [22]. It is used to obtain a sequence of samples approximately derived
from a specified distribution when direct sampling is hard. Mathematically,
the key idea is that the Markov chain samples, one variable at a time, and
ultimately converges to the distribution of the desired distribution, even with
the relatively simple computational scheme.
Assume that we have a joint distribution of four random variables, W , X,
Y and Z, and that it is hard to sample directly from it. Furthermore, assume
that it is easy to sample from the conditional distributions, e.g. sampling
W from P (W |X = x, Y = y, Z = z), sampling X from P (X|W = w, Y =
9
y, Z = z), etc. The idea is to repeatedly draw random samples from these
conditional distributions one variable at a time, based on values of previous
samples. If you sample long enough, then the Gibbs sampler will eventually
converge to a stationary distribution that is P (W,X, Y, Z).
Now suppose we have a joint distribution P (x1, ..., xn) where x are random
variables and n is a large number. It is not only intractable to sample from
this distribution but also hard to find the joint distribution. In order to
sample from the distribution, we can use the Gibbs sampling to sample from
a joint distribution if the full conditional distribution for each variable is
known. For each variable, the full conditional distribution is based on all the
other variables. This is illustrated in Algorithm 1, where t is the iteration
and xi is the variable that is being sampled out of n variables. For a given
number of iterations T , we go through all variables, one at a time, and sample
a variable x
(t+1)
i , from a conditional given all the other variables.
2.6 Computing Arithmetics
When an algorithm is implemented in hardware, one must first consider what
kind of numerical representation to use. Two commonly used representations
are fixed-point and floating-point numbers. Whether to use fixed-point or
floating-point can be determined by the range and precision required by the
arithmetic operations performed by the algorithm and also the hardware cost.
The floating-point format can represent numbers on an exponential scale and
offers wide dynamic range, but the floating-point unit implementation is more
expensive in terms of power consumption and chip area. The fixed-point uses
uniform scale and offers smaller dynamic range but is relatively cheaper to
implement than floating-point.
The fixed-point representation has a fixed radix point with integer part
and fractional part. Figure 2.1b shows an example 32-bit signed fixed-point
representation, which is described as Q23 or Q9.23, using Q number format
based on the number of integer and fractional bits. A radix point is set to
separate the number into a 9-bit integer part and 23-bit fractional part. This
representation will provide a dynamic range of 2−8 to 28 and a resolution of
2−23.




Figure 2.1: (a) IEEE single floating-point representation. (b) Q9.23 or Q23
32-bit signed fixed-point representation.
tissa or fraction. The IEEE single floating-point format [23] has 1 sign bit,
8 exponent bits, and 23 fraction bits as shown in Figure 2.1a. The exponent
bits relate to the dynamic range of the number while the fraction bits decide
the precision. The 8-bit exponent provides dynamic range of 2−126 to 2127
and 23-bit fraction can show resolution of 2exp−127 ∗ 2−23 where exp stands
for value represented by the exponent field.
We note that the floating-point in the IEEE format is a particular engineer-
ing choice, aimed at solving two problems: (1) ensuring adequate numerical
precision for complicated computations, and (2) ensuring portability of soft-
ware codes from platform to platform, which can only be guaranteed if every
platform (i.e. computer) implements the same arithmetic. But for custom
hardware, and in particular for hardware accelerators, we need not worry
about either of these, and we can take the design of the floating-point for-
mat, the number of bits in the fractional mantissa, the number of bits of
exponents, the nuances of rounding and such, and implement as we like, as
long as the algorithm works. As we shall see later in chapters, designing the
arithmetic is a critical part of achieving acceptable quality while minimizing
complexity and power.
The essential challenges we face for rendering a sampling architecture in
hardware - the choice of fixed-point versus floating-point, and the choice of
how much precision/range - are complex. Of particular note for us is that
sampling architectures, at their most basic level, are adding up a great many
11
small probabilities. This means we care deeply about the range as well as
precision of our resulting arithmetic choices. Also, it is important to note
that there may be varying levels of precision and dynamic range required for
different parts of an application. Therefore, the discussion involves not only
the choice of fixed-point versus floating-point and how to configure them,
but also solutions to questions such as the minimum number of floating-
point bits vs. fixed-point bits for the design and, moreover, whether different
representations would be mixed in a single architecture.
2.7 Summary
In this chapter, we covered the mathematical background for probabilistic
modeling, inference, and graphical models, and Gibbs sampling inference as
the approximate inference method, and reviewed the basic computing arith-
metics. Probabilistic modeling can be used to describe a joint relationship
between observed and latent variables, and the process of reasoning the struc-
ture of hidden variables given the observed variables is called inference. We
described graphical models as a descriptive and a concise way of representing
probabilistic models and introduced a popular MCMC-based method, Gibbs
sampling, as the approximate inference method most interesting for hardware
acceleration. In the following chapters, we will describe how an application
is described using probabilistic graphical models and apply Gibbs sampling
to perform inference on them to solve machine learning applications. The
basics of computing arithmetics will be useful in understanding hardware
design choices and trade-offs involved with them.
12
Chapter 3
Architecture for Mobile Perceptual Application
To address the challenges of implementing Gibbs sampling inference acceler-
ators, we first focus our efforts on exploring a practical mobile architecture.
We do so by utilizing a case study of a mobile perceptual application called
sound source separation and building the Gibbs sampling inference accelera-
tor for it.
The goal of the sound source separation is to mimic the human hearing sys-
tem, which is capable of separating one source out of a mixture of sounds. In
the real audio world, humans listen to a mixture of multiple sound sources
such as multiple people talking nearby along with various noises, and the
brain separates out the sources naturally, allowing one to concentrate on a
specific person speaking, while suppressing others, which we refer to as the
cocktail party problem [24]. However, such separation remains challenging for
machines. While there have been advances in source separation via machine
learning algorithms, there exist fundamental issues such as the trade-off be-
tween usable quality and computational complexity. The implementation of
source separation in a practical mobile form raises additional challenges such
as latency and power constraints, as the machine learning algorithms take a
very long time to run and also consume significant power when executed on
a conventional mobile CPU.
To be able to use source separation in a real world application, we pictured
a use case where source separation is used on a mobile phone to separate out
the desired voice of the speaker from the interfering noise sources around it,
e.g. talking over the phone in a loud airport. This use case requires streaming
with a low real-time latency while consuming low power. Not meeting any
of the these requirements will make it impractical for real world usage. For
example, the International Telecommunication Union (ITU) describes recom-
mended mouth-to-ear delay, the acceptable delay from the speaker’s mouth
to listener’s ear, to be less than 200 ms [25]. Furthermore, a typical latency
13
of mouth-to-ear delay of an LTE network is assumed to be approximately 160
ms [26], which means a practical source separation implementation should
take less than 40 ms. It is impractical to use source separation while talking
on the phone if doing so drains the battery fast due to complex computations.
This chapter describes a novel formulation of a Markov random field (MRF)
for source separation and a real-time and low-power hardware implementa-
tion of MRF maximum a posteriori (MAP) inference via Gibbs sampling
[27, 28, 29]. Through this case study, we show how the Gibbs sampling in-
ference can be performed fast enough to enable real-time execution of source
separation while consuming power low enough to be realized on a mobile
device. We describe our custom hardware implementation of sound source
separation on a FPGA and a virtual ASIC design to satisfy these real world
constraints.
3.1 Sound Source Separation
The sound source separation problem is often called blind source separation
(BSS) when the source signals are not observed individually and no informa-
tion is available about the mixture [30]. A common BSS approach to solve an
underdetermined source separation problem is the time-frequency masking
method [31, 32]. Time-frequency masking searches masking values that de-
note corresponding sources for each point in the spectrogram assuming that
spectrograms are sparse and do not have significant overlap in energy for
each point. Assuming that we have two sound sources, the desired and the
interfering sound sources, and two microphones, we obtain two spectrograms
by taking the short-time Fourier transform (STFT) of the sound mixtures
from the microphones.
A spectrogram is simply a 2-D matrix, in which each column represents
sound heard at one point in time, and each element of the column is the
energy of the signal at a specific frequency obtained by short-time Fourier
transform (STFT). The sound captured from each of the microphones will
be a mixture of the sound sources that are attenuated proportional to their
relative distances from the sensors.
Binary masking refers to the idea of separating the sources by selecting







































Figure 3.1: A binary masking on spectrograms generated from two (right
and left) microphones and sound sources.
each spectrogram, and “assigning” them to either the desired source, “Source
0,” or the interfering source, “Source 1,” as shown in Figure 3.1. We can
reassemble each source in the time domain by taking the assigned frequency
components at each time step and performing the obvious inverse transform.
Assuming a use setting where BSS is performed on a mobile phone with
two microphones, a situation where the number of mixture signals is less
than the sources, we focus on our novel formulation of BSS that models the
problem as an MRF.
3.2 BSS as Markov Random Field
Our BSS implementation formulates an MRF describing the probabilistic re-
lationship between the set of unknown binary labels x for all of the nodes
corresponding to each time-frequency point and the sound mixtures distribu-
tion y captured from two microphones. This relationship can be described as
the problem of computing a conditional probability distribution, or a poste-
rior distribution, P (x|y), over the hidden variables x given observed variables
y. By Bayes’ rule, the posterior distribution is given by




Figure 3.2: A partial representation of pairwise MRF that describes the
relationship between nine observed variables yi (shaded nodes) from the
input data and nine hidden variables xi (white nodes) is shown. The
likelihood distribution is represented by edges between each xi and yi pair
while the prior distribution is described by edges between each xi and its
neighboring nodes.
where P (y|x) and P (x) are likelihood and prior distributions, whose product
is the equivalent to the joint distribution P (x, y). P (x) can be thought of
as a normalizing constant which can be found by marginalizing the joint
distribution by y. The probabilistic relationship between the binary labels
x and observed variables y can be described as the likelihood distribution
while the relationship between the binary labels x is represented as the prior
distribution.
Source separation MRF is formed by assigning a binary mask label x ∈
{0, 1} to each time-frequency point of the spectrograms obtained by taking
the STFT of the sound mixtures captured from the microphones. The ob-
served variables, y, comprise interaural level difference (ILD), which is the
ratio of the energy values seen at this (time, frequency) point between the
left and right microphones:





where Y Li and Y
R
i correspond to energy of frequency at time given by the left
and right channel spectrograms from each of the microphone. At an intuitive
level, the MRF is a 2-D grid, where each observed data value is the ratio of
the “loudness” of the sounds heard from each microphone, at each time, at
each frequency. We seek to segregate these “pixels” into those we believe
comprise one of the sources, and those we believe comprise the other. For
simplicity, we assume these ratios are distributed as a pair of Gaussians.
Figure 3.2 shows a partial representation of the MRF constructed, where a
set of all nodes and a set of all edges between nodes in the entire MRF G are
given by ν and ε, respectively. Given a node i ∈ ν, the relationship between
the label on the node xi and the observed ILD yi is represented as a factor
φi. The edge between node i and j, (i, j) ∈ ε, represents the distribution
between the neighboring nodes using a factor φi,j. The posterior distribution
of MRF is the product of all node and edge factors given as








where Z is a normalizing constant that is the sum of all possible values of φ.
Given the MRF, we would like to find the maximizing assignment of labels x
of (3.3) by performing maximum a posteriori (MAP) estimation. To replace
the multiplications to summations we can represent this distribution in log








where θi(xi, yi) and θi,j(xi, xj) are called data cost and smoothness cost, re-
spectively [33, 34]. Given (3.4), MAP problem can be formulated as an energy
minimization problem on G = (ν, ε), where now the objective is to find the
assignment of labels that will minimize the costs or the energy function:


















We derive a Gaussian-based data cost formulation in which the data cost





if xi = 0
(Ai−µ1)2
2σ21
if xi = 1,
(3.6)
where Ai is the ILD, the log ratio of the energy values seen at node i, a time-
frequency point, between the spectrogram of the two microphones [27]. The
means µ0 and µ1, and the variances σ0 and σ1, correspond to the mean and
variance that of each sound component of the energy ratio Gaussians. The
smoothness cost θi,j(xi, xj) models the prior preference of two neighboring
nodes, defined on edge (i, j) ∈ ε, to encode spatial locality in frequency and
time domain in the spectrogram as follows [27]:
θi,j(xi, xj) = ‖xi − xj‖2. (3.7)
The goal of the optimization problem (3.5) is to find the set of labels x among
all possible per-node label choices that minimize the objective function, which
is binary for our application.
3.3 Gibbs Sampling Inference and Parameter
Estimation
In recent years, several algorithms have been proposed to solve the above
MRF-MAP inference problem [35, 33, 36, 34]. While many of these algo-
rithms are successful in solving the problem, it is still intractable to run
these algorithms in real-time for common perceptual applications like sound
source separation or computer vision applications. Gibbs sampling and other
MCMC or sampling-based methods can be slow to converge and difficult to
diagnose the convergence. However, simplicity of implementation and the-
oretical guarantees of convergence make it attractive as an algorithm to be
accelerated.
Gibbs sampling was actually first introduced in the context of solving MRF
for compute vision application [22]. Gibbs sampling operates iteratively by
choosing a variable in each time step and updating it by sampling from
18
its conditional distribution given the other variables in the model as shown
in Algorithm 1. It is especially effective on MRF because it only needs
to observe neighboring nodes that are directly connected to the sampled
node, the Markov blanket, due to local Markov property [37], that imposes
conditional independence. This allows local computations with high spatial
locality since processing a single node requires data accesses only for the
neighboring nodes and is more efficient on hardware than accessing all nodes.
Furthermore, there are many variants of Gibbs sampling that allow more
parallelism with trade-offs and they will be discussed in Chapter 5.
In the above formulation of data costs for source separation MRF, the
Gaussian parameters for the data cost function are hidden parameters that
describe the distribution of sound sources. Without having proper parame-
ters that closely describe the actual distribution, the inference results would
not be of high quality. To find the hidden parameters, we can train the MRF
offline to learn the parameters by using a labeled data set. In fact, that is
how most mobile deep neural network implementations are done; training is
not done on mobile but offline in advance. However, if the parameters are
fixed, source separation only performs well given the specific configuration of
sound sources relative to the microphones that correspond to the energy-ratio
distribution given by the fixed parameters. This is not very practical for real
world application of source separation where sources could be moving relative
to the microphones on the mobile phone. Furthermore, user intervention to
modify the parameters is hard to impose on a real-time system.
Just as it is common for image segmentation applications to receive initial
seed or hints about the segmentation from the users [35], we can also provide
initial observations or guesses about the underlying distribution. We can then
perform the expectation-maximization (EM) [38, 39, 40, 41, 42], an iterative
optimization method, to learn the Gaussian parameters of the distribution
on the fly while performing source separation. Each iteration of EM consists
of an expectation (E) step, which find the distribution for the unobserved
variables, given the observed variables and the current estimate of the hidden
parameters, and a maximization (M) step, which re-estimates the parameters
using the distribution found in the E-step.
For source separation, we start with rough guesses of the Gaussian param-
eters for the energy-ratio distribution of the sound sources for the data cost
function of MRF, based on the geometric orientation of the microphones and
19
Algorithm 2 Creating masks for source separation via EM
1: procedure SourceSeparationMask(Ai, x)
2: for each EM iteration do
3: (E-Step) GibbsSamplingInference:
4: for t = 1 to max iteration T do
5: for each node i = 1 to I do
6: x
(t+1)
i ∼ P (xi|x
(t+1)














the sources. We assume that the desired sound source is closer to the micro-
phone. When we perform the EM algorithm, E-step finds the distribution of
labels given the estimate of the Gaussian parameters by performing Gibbs
sampling inference on the MRF and M-step updates the parameters to a
more accurate estimation using the distribution of labels found as described
in Algorithm 2. By performing M-step the data cost is updated and E-step
will be performed on the updated MRF resulting from it. The EM iterations
are repeated until convergence and the inference results on the final MRF
represent the MAP solution to the source separation problem, in the form of
a binary mask that can be used to separate the sources. This is especially
useful in scenarios where the sources are not stationary with respect to the
microphones.
3.4 Profiling Source Separation
To run this application in real-time, we must find ways to reduce the run-time
of this algorithm. First of all, we profiled the sound source separation appli-
cation using gprof [43] to see where most of the time is being spent to execute
it. Figure 3.3 shows the profiling results of running our prototype on a single
core of 2GHz Intel Xeon X6550. From the pie chart, the ieee754 exp sse2
(50.4%) is the SIMD instructions for transcendental functions; these instruc-














Figure 3.3: Sound source separation execution profile.
at the node each time a sample is to be obtained. The optimizeAlg (26.6%)
is the function that runs the Gibbs sampling inference on the MRF to find
the binary mask. Together, they account for (77%) of the total runtime.
3.5 Real-Time Streaming Implementation
In this section, we present a real-time (low-latency) streaming implemen-
tation of the proposed MRF source separation. The high-level view of the
implementation, from input to the output sound mixtures, is shown in Fig-
ure 3.4, with all steps of a streaming system. We will discuss each of steps
in detail, providing explanation for design choices made with the goal of
achieving a real-time and low-power streaming source separation on a mobile
phone. The proposed implementation was designed to satisfy the following
real world design constraints:
• The design should be able to handle a constant stream of audio input.
In order to have any practical use such as suppressing noise while talk-
ing on the phone in a noisy environment, source separation should be
implemented in a streaming fashion.
• The design should maintain the mouth-to-ear latency below 40 ms. The




























Figure 3.4: Streaming source separation.
ing mouth-to-ear delay latency indicates that most users are “very sat-
isfied” as long as latency does not exceed 200 ms. However, a typical
latency or mouth-to-ear delay of LTE network is assumed to be ap-
proximately 160 ms [26], which means latency should be smaller than
40 ms to achieve “very satisfied” quality of service.
• The design should not consume more than a watt and desirably less
than 100 mW. Using source separation in regular voice talk should
permit several hours of usage on a typical mobile phone.
3.5.1 Spectrogram generation
In the first step, the pair of audio input streams is converted to a pair of
spectrograms by performing STFT. We assume that the input signals are
sampled at 16,000 Hz, which is highest of the sampling rate supported by
a typical mobile phone. The higher sampling rate results in more temporal
resolution which can be useful for achieving better separation of the mixtures.
Furthermore, size of the FFT determines the frequency resolution of the
resulting spectrogram. We have decided to use 1024-point FFT with 50%
overlap and a cosine window. One FFT on 1024 samples will correspond to
one frame or a single column on the constructed MRF with height 513. The
sampling rate, FFT size and the amount of overlap determine the temporal
resolution of the resulting MRF.
22
Figure 3.5: ILD generation block. It takes real and imaginary parts of the
STFTs generated from sound mixtures captured by the stereo microphones.
Each of the arithmetic function units is registered to allow pipelining. A
constant of 10 log10(2) is multiplied with result of log2 unit to produce log10
results.
3.5.2 ILD generation and MRF construction
The ILD matrix used for computing the data cost function of MRF is gen-
erated by taking the log ratio of energy corresponding to each of the time-
frequency points in the spectrogram. Figure 3.5 shows the ILD generation
architecture for a single time-frequency point in the spectrogram. STFT
results stored in intermediate buffers are streamed into the ILD generation
block that is pipelined with each stage that consists of Q16.16 fixed-point
elementary arithmetic units. The log function used to calculate ILD is imple-
mented in Q16.16 fixed-point based on methods that use lookup table-based
Taylor approximation [44, 45], to minimize resources and latency.
When a single frame is processed, it will result in a single column of a
spectrogram and the ILD matrix. In order to take advantage of the spa-
tial locality of the sources in both the frequency and time domain, a larger
MRF is preferred to allow more prior information to be included. Note that
a single column of the MRF does not contain much information about the
approximated Gaussian mixtures. Although it will have a smoothness prior
in the vertical (frequency) direction, it will not have horizontal (temporal)
23
prior information. In order to circumvent this problem, we construct an MRF
not only using the current frame of audio being processed but also several
previous frames to take advantage of smoothness in the temporal direction.
The larger the MRF, the better the separation of sources. However, a larger
MRF means more memory and more computation. We must determine the
optimal number of frames to keep while considering memory, latency, power
trade-offs. Once a single column of ILD is generated, the resulting data costs
are added to a fixed-sized MRF, which then retires the data cost correspond-
ing to the oldest frame. If we denote one Gibbs sampling operation as the
atomic operation, the total number of Gibbs sampling operation (GSops) for
estimation and inference is given by
GSops = 513 ∗ (#ofFrames) ∗ (#ofGibbsSamplingIterations)
∗ (#ofEMIterations),
(3.8)
which increases linearly with the number of frames used to construct MRF.
Furthermore, the size of the buffers required to hold MRF variables will
be proportional to the number of frames per MRF. In order to determine a
practical size for a hardware implementation, a set of experiments were made
with varying number of frames, Gibbs sampling iterations and EM iterations.
In order to find the appropriate number of frames, or number of columns
in MRF, we must verify that source separation’s quality is acceptable for a
given size. The qualitative performance of source separation can be measured
by signal-to-distortion ratio (SDR), which is widely used as an objective
separation quality evaluation metric [46]. SDR is found by comparing the
original spectrogram with the masked spectrograms corresponding to the
separated signals. With the number of EM and Gibbs sampling iterations set
to a high number to prevent underfitting, we vary the number of frames and
measure SDR. The results plotted in Figure 3.6 show that source separation
does not work well with five frames or less and gives usable quality (5 dB) or
higher for seven frames and up. For our implementation, we decided to use
eight frames to minimize memory size and also to reduce the runtime while
achieving good SDR.
Two additional experiments were performed while fixing the MRF size to
eight frames. If the Gibbs sampling iterations were set to a high number
to avoid not converging, even 1 iteration of EM showed results that are 5
24
Figure 3.6: Signal-to-distortion-ratio for varying number of frames
(columns) of the MRF.
dB or higher. On the other hand, if the EM iterations were set to a high
number while sweeping through Gibbs sampling iterations, we observed that
for three or fewer iterations, SDR values were poor. Although fixing the
number Gibbs sampling iterations high showed good SDR results, this does
not mean it will perform well when the distribution is changing rapidly, or if
the sound sources are moving fast. Without having some EM iterations to
account for these shifts in Gaussian parameters, the quality of the separation
will be served. Based on these observations, we have set the number of EM
and Gibbs sampling iterations to four each.
Initially, Gaussian parameters for calculating the data cost of the MRF
are approximated based on the location and the orientation of the sources
relative to a pair of microphones on a mobile phone. These are updated
using EM where E-step is the Gibbs sampling inference and the M-step is
updating the Gaussian parameters from the resulting labels from the E-step
and modifying the data cost of the MRF.
The data cost generation block is shown in Figure 3.7. Every cycle, it
takes an ILD value from a single time-frequency point in spectrogram along
with current Gaussian parameters to calculate a new data cost for it. While
the figure looks very simple, it is a relatively large block with a divider. Two
identical data cost generation blocks are used for each of the binary label
values. The registers between each stages were omitted in the figure.
25
Figure 3.7: Data cost generation block. All arithmetic function units are
registered for pipelining.
3.5.3 Gibbs sampling MAP inference
For the E-step of the EM parameter estimation, we perform Gibbs sampling
on MRF constructed with previous Gaussian parameters. Once one column
of the data cost is generated from corresponding input frame, it is passed
into the Gibbs sampling inference block, which locally caches the data costs
and the label assignment for all the nodes in the entire MRF, which includes
previous frames. The data cost pertaining to the oldest frame on the memory
is replaced with the data cost for the current frame. Figure 3.8 shows how
MRF is constructed with one current frame and seven previous frames. As
new frames enter the system, the oldest time frame is retired and the new
frame takes place in the MRF. This method allows streaming implementation
while taking advantage of the label values inferred from previous frames for
faster convergence and better accuracy.
Using the MRF variables cached locally, a Gibbs sampling pipeline takes
one column at a time, following the order in time of the frame, and sequen-
tially sample all the nodes in one column. Gibbs sampling pipeline is designed
to take up to 1024 nodes at a time, and process them sequentially. The in-
put to the pipeline is a series of nodes with corresponding node index, label,
labels for the neighboring nodes and data costs. The output is the updated
labels for each of the nodes with a corresponding node index. Depending
on the variant of Gibbs sampling used, the number of Gibbs sampler within
the pipeline could be increased to allow multiple chains or support differ-
ent schemes. For our implementation, a single processing element provided
enough performance while minimizing the latency and power consumption.
Once the Gibbs sampling pipeline goes through all columns, it is considered
to have completed an iteration of Gibbs sampling. Our implementation uses
26
Figure 3.8: The constructed MRF contains information from eight input
sound frames: one current and seven previous frames. When a new sound
frame enters the system, the column of MRF corresponding to the oldest
frame in the MRF is retired and new frame fills the void. This enables the
streaming implementation with minimal hardware overhead while taking
advantage of the temporal information from previous frames for faster
convergence and better accuracy.
a fixed number of iterations, predetermined through the study mentioned
above. The fixed number of iterations simplifies control logic and is more
practical for the streaming implementation. As mentioned above, the total
number of Gibbs sampling operations is linearly proportional to the number
of frames used to construct the MRF. Figure 3.9 shows the architecture of
the Gibbs sampling inference block and the Gibbs sampler which samples
one node given the data cost and labels of the neighboring nodes. The expo-
nential function in the Gibbs sampler is implemented in Q16.16 fixed-point,
which is a close modification of a method that uses a table-based Taylor
approximation [3], with the 256 element look-up table (LUT). The pseudo
random number generator in the Gibbs sampler is implemented in 32-bit
27
fixed-point using 43-bit linear feedback shift register (LFSR) and 37-bit cel-
lular automata shift register (CASR) [4]. Both of the functions were imple-
mented after precision analysis using a C++ software reference for source
separation.
3.5.4 Masking, updating Gaussians and output reconstruction
Once the Gibbs sampling inference is completed, resulting labels for the MRF
are passed onto the Gaussian update block to find new Gaussian parameters.
This step corresponds to the M-step in EM algorithm. The updated Gaussian
parameters are used to modify data costs for the MRF and the next iteration
is performed and so forth. Once the number of EM iterations is reached,
the labels are passed to the masking block, where the FFT results for the
current frame are masked and passed onto the ISTFT block. The ISTFT
block maintains the previous frame data which can be used with current
frame data to perform IFFT and reconstruct the output separated sources.
Figure 3.10 shows the architecture of the Gaussian update block. As
shown, ILD value along with binary labels and corresponding indices are
used to generate new mean and variance. It is an iterative pipelined archi-
tecture with four dividers. Once all the ILD values and binary values for the
MRF are iterated, new values will be passed onto the data cost generation
block to update the MRF. This completes the EM iteration.
3.6 Gibbs Sampling Hardware Accelerator
Architecture
Figure 3.11 is the top-level architecture of our proposed streaming hardware.
One key feature to note is that the data flows from the left (input) side of
the diagram to the right (output) side, in the dataflow computing fashion
[47, 48, 49]. The arrows indicate the direction of the data movement. There
are two iterative loops in this architecture. One is the Gibbs sampling loop
which is within the Gibbs sampling inference core, and the other is shown
by arrows traveling left from label FIFO into Gaussian update block, which
realizes the EM loop for updating Gaussian parameters. The use of iterative
loops requires numerous buffers or FIFO as shown.
28
The description for each of the blocks are given as follows:
• Input FIFO : Input FIFOs buffers 512 input audio samples from two
sound mixtures and works with memory interface.
• STFT : There are two STFT blocks for each of the sound mixtures.
An STFT block utilizes 1024-point FFT blocks to perform STFT. The
block takes 512 new samples at a time and buffers 512 previous samples
to contain 1024 samples required for 1024-point FFT with 50% overlap.
• FFT Real and Imag FIFO : These blocks buffer real and imaginary
parts of the FFT results. These are used for ILD generation and also
are kept until the end of the inference process to be masked and used
for reconstruction or the ISTFT.
• ILD Generation : ILD is generated using STFT results.
• ILD FIFO : ILD is stored in a FIFO to be used to generate data costs
and also to update the Gaussian variables after each iteration of Gibbs
sampling inference, between EM iterations.
• Data Cost Generation : This block takes ILD and Gaussian vari-
ables to compute data cost for the MRF.
• Data Cost FIFO : Data cost is stored in a FIFO to be accessed mul-
tiple times during the Gibbs sampling iteration by the Gibbs sampling
inference core.
• Gaussian Update : This block stores and updates Gaussian param-
eters to be used to generate data costs of the MRF.
• Gibbs Sampling Inference Core : Gibbs sampling inference core
iteratively performs Gibbs sampling, pulling data cost from the Data
Cost FIFOs and also reads and updates labels in the Label FIFO.
• Label FIFO : It is a buffer for storing labels for the MRF that is
being inferred and is updated on every Gibbs sampling iteration, which
consists of sampling for the whole MRF. Labels are also used to mask
stored FFT results to produce masked results.
29
• Masked Real and Image FIFO : After each EM iteration, FFT
results from FFT real and imag FIFOs are masked using labels from
label FIFO to create masked inputs for ISTFT blocks.
• ISTFT : There are two ISTFT blocks corresponding to each of the
sound mixtures. They use masked FFT results to perform 1024-point
IFFT. Half (512) of the IFFT results from previous input audio samples
are buffered for overlap addition.
• Output FIFO : These blocks buffer source-separated output sound
mixtures and provide memory interface.
Our system uses a 32-bit fixed point arithmetic with 16 fractional bits
(Q16) for the most parts of the design. The number of fraction bits required
to convert the floating point arithmetic of the source separation algorithm
was determined by checking dynamic range and adequate amount of preci-
sion required to produce correct values within the system. The number of
fractional bits is increased in certain parts of the design such as the Gibbs
sampling inference core where we deal with probabilities, very small num-
bers. We extend the fractional bits to 32 bits to provide more precision.
Another exception to this fixed point number format is input and output
stream values that are fractions with magnitude is less than 1. To maximize
precision, input to STFT and ISTFT to output uses 31 fractional bits (Q31),
with a sign bit. The iterative nature of the algorithm requires its intermedi-
ate results to be stored in buffers resulting in numerous FIFOs as shown in
Figure 3.11.
3.7 Experimental Evaluation
To evaluate our design, we have several different implementations of the
algorithm, each designed to evaluate a different aspect of this design. These
are:
• Experimental audio setup: We provide the audio setup for our sound
source separation implementation.
• Software reference architecture: We implement the design straightfor-
wardly in C using standard IEEE floating point, and benchmark on
30
both desktop and simulated mobile phone processors; this allows us to
study both speed and power of a straightforward software-only imple-
mentation.
• FPGA hardware implementation: We implement the design on a stan-
dard Xilinx Vertex5 FPGA platform, to validate that functional cor-
rectness of our hardware architecture, and the ability to run “fast
enough” at a reasonable clock rate. This implementation is in Ver-
ilog.
• ASIC prototype design study: Using the Verilog from the FPGA, we
take the core streaming compute engine components of the design and
synthesize in standard 45nm process, which allows us to estimate both
the clock frequency at which the design needs to run, and the power.
We describe each of these design studies in the following sections.
3.7.1 Experimental audio setup
To test our implementation, we chose two speech signals, one female and one
male, from the TIMIT corpus [50] as the input signals. These speech signals
are sampled at 16,000 Hz and encoded with 16-bit PCM. For convenience,
we model the audio input as a pair of microphones separated by 15 cm, a
distance equal to the length of a typical mobile phone. We created two con-
volutive mixtures with simulated room reverberations corresponding to each
microphones by using the Roomsim Matlab toolbox [51]. A comprehensive
mixture of a desired source (a desired speech signal) and five interferences (a
interfering speech, a blender, a rolling can, a washer and microphone noise
signals) in the cellphone environment from [27] is used for the experiments.
The dimensions of the room and the location of the sound sources and the
microphones on the cellphone are shown in Figure 3.12 where the units are
in meters. The qualitative performance of the implementation is measured
by signal-to-distortion ratio (SDR) [46].
31
3.7.2 Software reference architecture
We have implemented a software prototype of source separation using MRF
via Gibbs sampling coded in C++. Prior to considering the mobile use-
case, we benchmarked the prototype running on an Intel Xeon X6550 2GHz.
This high performance CPU took 31.695 ms to run sound source separation
on 32 ms of input audio, which is the smallest granularity given 16 KHz
sampling rate and 1024-point STFT. This essentially means that at least
64 ms of latency will be required. The International Telecommunication
Union’s recommendation [25] regarding mouth-to-ear delay latency indicates
that most users are “very satisfied” as long as latency does not exceed 200 ms.
However, a typical latency or mouth-to-ear delay of LTE network is assumed
to be approximately 160 ms [26], and delay of source separated voice on LTE
network easily exceeds 200 ms. This unacceptable level of latency justifies our
interest in custom hardware. Additionally, we need to understand the power
consumption, using a more realistic mobile cpu. We benchmarked code on
a simulated ARM Cortex-A9 CPU using a cycle-accurate simulator GEM5
[52] and power, area modeling tool McPAT [53]. It took 23.32 simulation
seconds to run a 4 second batch of audio on a ARM Cortex-A9 model. Using
the the CPU statistics produced from GEM, we use McPAT to estimate
the power of our software reference. The peak power consumption on ARM
Cortex-A9 is estimated to be 3.661 W which is unacceptable for prolonged
usage on a mobile phone. Assuming that most mobile phones have roughly
2,000 mAh and nominal voltage of 3.8 V, the energy is 7.6 Wh. This barely
allows the phone to run the application for two hours even neglecting other
consumptions.
3.7.3 FPGA hardware implementation
We implemented a fully synthesized Verilog version of our design running
on a Convey HC-1 hybrid-FPGA platform [54, 55]. The platform consists of
Intel Xeon 5138 2.13 GHz dual-core host processor, and four Xilinx Virtex 5
(V5LX330) FPGA coprocessors running at 150 MHz each. The platform only
runs at 150 MHz for reasons of constraints on the memory subsystem; how-
ever, this is not necessarily a design constraint on the final ASIC design. The
system has a single cache-coherent shared virtual memory which allows easier
32
Table 3.1: FPGA resource utilization.
Resource Utilization
Slice Register 101302 / 207360 (48%)
Slice LUT 90280 / 207360 (43%)
Slice LUT FF 119300 / 207360 (57%)
BRAM 113/288 (39%)
DSP 36 / 192 (18%)
Table 3.2: Memory instances.







Total Size 207 kB
data transfers and communication between the host and the co-processors.
Our design was able to fit on one Virtex 5 with resource utilization shown in
Table 3.1.
The iterative nature of the algorithm requires its intermediate results to
be stored in buffers resulting in numerous FIFOs as shown in Figure 3.11.
The size and number of memory instances have been listed in Table 3.2, with
the total of 207 kB used. The hardware is fully parameterized can be set to
allow expansion to other larger inference applications.
There are several original fixed-point implementation of core mathematical
functions used in our hardware. The exponential unit used in the Gibbs
sampling inference core is implemented in fixed point, based on the work
of a parameterized floating-point exponential function for FPGAs [3]. The
Gibbs sampling inference core also contains a fixed-point pseudo random
number generator based on LFSR [4]. The log function used to calculate
ILD is implemented in fixed-pointed based on [44] and [45].
The pipeline structure containing the Gibbs sampling computation node
is a module within the Gibbs sampling inference block, which contains the
buffer for storing the data cost of the current EM iteration. The size of
the data cost buffer is 513 frequency points by temporal (timed samples)
33
width of the MRF, which is adjustable by a parameter. As discussed earlier,
a larger MRF will produce better qualitative results with the sacrifice of
the run-time. By experimenting with multiple widths with varying EM and
sampling iterations and observing the resulting SDR values on our test case,
we were able to decide on the EM and Gibbs sampling iterations of four
each and the MRF block width of eight time points on spectrogram, which
corresponds to 256 ms of audio stream. Each MRF constructed will contain
information from input audio up to 256 ms in the past. A correlation to
input data previous to that is preserved by the labels for the MRF block
that are buffered and updated each iteration as well. The number of pipeline
modules can easily be selected as well.
For our source separation implementation, the goal was to meet the mouth-
to-ear delay recommendation of 200 ms of the International Telecommunica-
tion Union (ITU) [25] as we envisioned its application on a cellphone. With
the FPGA running at 150 MHz, we were able to achieve 1.601 ms latency
from the input to the output stream. This is one 512 sample block of in-
put which corresponds to 32 ms of audio. Even with 160 ms LTE network
delay, we are able to limit the mouth-to-ear delay under 200 ms. The result-
ing implementation requires 64 bits of memory data width and 1.2 GB/s of
bandwidth.
3.7.4 Virtual ASIC design study
While the FPGA results show that a real-time source separation implementa-
tion is feasible on the FPGA, it is definitely not practical for a real appliance
due to power restrictions. We implemented several designs using the IBM 45
nm process to synthesize and place and route our design and find the gate
count, area and power estimates of different virtual ASIC implementations.
We started our design by using 150 MHz to match that of the FPGA
implementation. As Table 3.3 shows, the power consumption is relatively
high with both 12-track RVT cell and 9-track UVT cell designs consuming
more than 400 mW. Because our design had really small latency of 1.6 ms,
we can definitely use a much lower operating frequency to reduce power.
When the operating frequency is reduced to 30 MHz, which still meets the
previously defined latency requirements, we see that the power consumption
34
Table 3.3: A comparison and a breakdown of power for different designs
and different types of cells. 12tr-rvt-150m is IBM 45nm process with 12
track RVT cells running at 150 MHz. 12tr-rvt-30m is IBM 45nm process
with 12 track RVT cells running at 30 MHz. 9tr-uvt-150m is IBM 45 nm
process with 9 track UVT cells running at 150 MHz. 9tr-uvt-30m is IBM 45
nm process with 9 track UVT cells running at 150 MHz.
12tr-rvt-150m 12tr-rvt-30m 9tr-uvt-150m 9tr-uvt-30m
Memory 404.517 79.130 403.836 78.965
Register 105.565 63.091 56.852 15.811
Sequential 0.050 0.049 0.005 0.005
Combinational 46.487 46.349 8.640 5.097
Total 556.620 mW 188.619 mW 469.332 mW 99.879 mW
Area 6731304 6735203 6554452 6548631
drops below 200 mW. For 9-track UVT design, it consumes less than 100
mW, which was one of our original design constraints.
In general, designs using 12-track RVT cells resulted in much higher power
numbers. This is due to the size of the cell being larger and also the lower
threshold voltage. The UVT cells have a higher threshold voltage which
means they are slower but consume less power than RVT cells. While the
difference of using 9-track UVT cells over 12-track RVT cells is not that great
for designs running at 150 MHz, it is very noticeable for designs running at
30 MHz, with 9-track UVT design consuming almost half as much power as
the 12-track RVT design. Furthermore, we observed that SRAM memory
power accounts for more than 50% of power for all but 12-track RVT design,
which is still more than 40% of the power consumption. For 9-track UVT
design, memory consumes almost 80% of the entire power. One thing to note
here is that the SRAM cells were only available in UVT cells, which explains
why designs with same operating frequency results in similar memory power
consumption. This appears to be a side-effect of our current technology
library, which is not aimed at low-power mobile uses. When we resynthesized
our design using a lower frequency of 20 MHz, we achieved a significantly
lower power of 69.977 mW. Our virtual ASIC prototype using 9-track UVT
cells, whose layout is shown in Figure 3.13, resulted in gate count of 9,126,222,
area of 6.6 mm2. The overall design is attractively small, under 10 million
gates. When running at 150 MHz and 20 MHz, the power consumption is
8X and 52X better than the ARM reference estimate of 3.7 W.
35
Table 3.4: 32nm memory instances





Total Size 157 kB
3.7.5 Further optimizations and related work
While the 45nm implementation could achieve a low power of 70 mW, it was
evident that most of the power resulted from its use of many SRAMs and
using an outdated and non-low-power process technology. In order to further
optimize power, some architectural modifications were made to our design
to allow the buffers holding the MRF variables within the Gibbs sampling
inference block to be reduced to a smaller size, without any sacrifices on
the inference results. Furthermore, Synopsys 32/28 nm Generic library was
used to synthesize our updated design. Both of these modifications indicate
that our design is still amenable to power optimizations. This resulted in
25% less memory for the design as shown in Table 3.4. Note that the list
of memory instances is vastly different because most of the large memories
were replaced with smaller memories as the Synopsys 32 nm library only
supported small SRAMs. These two modifications reduced the power of our
ASIC implementation down to 40.034 mW.
While we are the first to implement a real-time streaming MRF source sep-
aration, there are previous work using ICA to perform BSS. Table 3.5 shows
different implementations compared to our implementation. It is important
to note that while some of them claim to be real-time, they are not real-time
according to our mobile use case constraints. One additional thing to note is
that our sound source separation using MRF has the ability to incorporate
prior information like the smoothness of the ILD of sound mixtures, which
allows us to solve underdetermined mixing problem while ICA only works
well for overdetermined cases, where you have at least as many microphones
as sources.
36
Table 3.5: A comparison with the related work. Tokkola’s BSS [1] has the
longest latency. FastICA [2] has improved latency over [1] and supports
higher sampling rates not shown in this table. Note that FastICA is listed
twice for different sampling rates.
Name Torkkola’s BSS FastICA MRF BSS
Algorithm ICA ICA MRF
Platform FPGA FPGA FPGA, ASIC
Sampling rate 8000 Hz 8000 Hz 16000 Hz 16000 Hz
Latency 620 ms 125 ms 62.5 ms 1.607 ms
Total 71.2 MHz 50 MHz 50 MHz 150 MHz
3.8 Summary
In this chapter, we presented a novel real-time and low-power streaming
architecture for sound source separation using MRF. Our implementation is
capable of learning hidden parameters on the fly using EM, while performing
MAP inference to find the best binary mask label assignment. An FPGA
implementation runs at 150 MHz with 207 kB RAM while achieving SDR
up to 7.021 dB with only 1.6 ms latency, a 22X speed-up; it confirms real-
time streaming feasibility. A 45 nm virtual ASIC design running at 20 MHz
requires fewer than 10 million gates while consuming 70 mW, which is 52X
better than ARM Cortex-A9 software reference design. By using Synopsys
32/28 nm Generic library, we were able to achieve 40 mW while running at
150 MHz. This result indicates that we should be able to get more power
reductions if we use the latest commercial technology node. Overall, we
believe that our design shows that our MRF Gibbs sampling architecture is
viable for real-time streaming source separation implementation for mobile
form factor.
37
(a) Gibbs sampling inference block
(b) Gibbs sampler
Figure 3.9: The architecture of the Gibbs sampling inference block and the
Gibbs sampler. (a) Caches local copy of all the variables of MRF within the
block. The Gibbs pipeline creates a list of all nodes to be sampled along
with indices, data costs, labels and edges per node. (b) Receives labels of
the neighboring pixels and also the existence of edges to its 4 neighbors.
The exponential function is implemented similarly to [3]. The pseudo
random number generator is an implementation of [4].
38
Figure 3.10: Gaussian parameters update block. It takes cached ILD values
along with labels corresponding to same node of the graph to update























































Figure 3.11: Streaming Gibbs sampling source separation architecture
showing all the intermediate buffers.
39
Figure 3.12: A 4.45x3.55x2.5 room with a mobile phone and five sound
sources. Source 1 is the desired speech signal, Source 2 is the interfering
speech signal, Source 3 is a blender, Source 4 is a rolling can and Source 5
is a washer.
Figure 3.13: The ASIC layout using 45 nm 9-track UVT library. It is
evident that our design uses a lot of SRAM, shown by the light blue blocks.
The logic cells are distributed in the gridded area in the middle
40
Chapter 4
Architecture for a Large-Scale Cloud
Application
In this chapter, we shift gears and discuss our work on a useful large-scale
cloud application in natural language processing called topic modeling. We
use topic modeling as an example to identify and tackle the challenges in
creating Gibbs sampling inference accelerators to be deployed on a cloud.
Topic modeling is a useful tool to help us organize, understand and summa-
rize large amount of collection of data. We take one of the most commonly
used topic models called latent Dirichlet allocation (LDA), and describe the
architecture for accelerating the Gibbs sampling inference for it.
Most often the challenges and requirements involved with designing an
accelerator for cloud are different from those for mobile. The single most
important metric for large-scale cloud applications is throughput. Larger
throughput may reduce the number of nodes in the cloud required to handle a
specific workload or allow more complex models to be deployed. Furthermore,
while the latency of a cloud application may still be important, there is less
emphasis on achieving low-power compared to mobile use cases.
4.1 Topic Models
Topic model is a statistical model that can be used to discover “topics”
that occur in a collection of documents [56]. It is frequently used as an
unsupervised text-mining tool for discovery of hidden semantic structures of
text. Topic modeling is used for automatically organizing, understanding,
categorizing data. For example, it can be used on an archive of newspaper
articles to find topics for each of the articles and recommend articles that are
of similar “topics.” The concept of topic models has been applied to other
areas such as genetic data, images and social networks, to find the hidden
structures of the data.
41
The origin of topic models can be traced back to Latent Semantic Analysis
(LSA) [57], which is sometimes called Latent Semantic Indexing (LSI) in the
context of information retrieval. While it has been used as the basis for the
development of topic models, it is not a probabilistic model. The general
idea is to create a mapping of documents and terms to a low-dimensional
representation such that it reflects semantic association. The similarity of
documents can be computed based on the inner product in the latent seman-
tic space. LSI uses singular value decomposition (SVD) to do so and achieves
significant compression in large corpus. [58] used a generative probabilistic
model of the corpora to study LSI to recover aspects of the generative model
from the data.
For generative probabilistic models, it is more natural to use Bayesian
methods to fit the model to data instead of using LSA. As a result, Prob-
abilistic Latent Semantic Analysis (PLSA) evolved from LSA [59, 60, 61].
The PLSA (or PLSI) models each word of a document as a sample from a
mixture model whose components are multinomial random variables. These
multinomial random variables can be viewed as representation of “topics”
from which each word is generated from. Each document contains words
that may be generated from different topics and represented as a probability
distribution on a set of topics.
However, PLSA does not provide a probabilistic model at the level of
documents. It does not have a generative probabilistic model for the numbers
that represent the mixing proportion for topics. This results in linear growth
of parameters in the model with the growth of the corpus and also poses a
problem for documents outside of the training set. Latent Dirichlet allocation
(LDA) [62] extends the PLSA method by defining a complete generative
process that adds assumptions on how the mixture distribution is generated.
While the topic models have emerged from the text analysis community
for topic discovery in a corpus of documents, researchers in other fields have
introduced this approach to their respective fields to analyze large amounts
of data. Some of these areas include collaborative filters, image classification
and bioinformatics [63, 64, 65].
Topic models often deal with vast amounts of data and the models tend to
be very large compared to perceptual applications that are often bounded by
the rate at which the input data is streamed into the system. For example,
when processing a corpus of documents you may be dealing with an archive
42
of millions of documents each with thousands of words. This is why topic
modeling often requires processing on the cloud.
4.2 Latent Dirichlet Allocation
Latent Dirichlet allocation (LDA) [62] is the most popular topic model, con-
ceived as a generative statistical graphical model that views documents as
a mixture of topics and assumes that the words are generated from these
topics based on their distribution. Given a collection of documents, LDA is
used to learn the model and perform inference on documents to determine
the mixture of topics for each documents. We focus specifically on the basic
LDA model and the MCMC-based inference method to be used to interpret
it.
While LDA is used in fields other than text analysis, we will use the ter-
minology of text collections, such as “corpus,” “documents,” “topics,” and
“words” to describe the algorithm. Formally, a word is a discrete value that
is defined to be an item from the vocabulary indexed by 1,...,V. A document
is a sequence of N words denoted by w = (w1, ..., wN). A corpus is a collec-
tion of M documents denoted by D = w1, ...wM . A topic is defined to be a
distribution over a certain vocabulary.
Intuitively, a document can be about a mix of topics, such as “dogs” and
“cats.” If most of the document is about “dogs,” you can imagine that “dogs”
topic could be 90% of the document while 10% is about “cats.” Furthermore,
“dogs” document would have high chance of containing words such as “pets,
puppy, and hound.” LDA is a statistical model for a collection of documents
that tries to capture this intuition by using generative process to discover
hidden semantic structures of the text.
If a topic is a distribution over some fixed vocabulary, “dogs” topic would
have words about dogs with high probability and “cats” topic would have
words about cats with high probability. In the generative process, it is as-
sumed that the topics are specified before the documents are generated. For
each document in a collection, a distribution of topics is randomly gener-
ated and for each of the word in a document, a topic is randomly chosen
from the distribution and a word is randomly selected from the distribution
over vocabulary. For example, if there is a collection of documents about
43
animals, and a distribution with topics “dogs” and “cats” was selected for a
document, this document will have each of its words generated from “dogs”
and “cats” distribution over the vocabulary for this collection. These words
could include words mentioned above such as “pets.”
4.2.1 Probabilistic modeling
LDA is one form of probabilistic modeling. In generative probabilistic model-
ing, the text data is assumed to be created through a generative process that
includes latent variables. Such process defines the joint distribution over the
observed and latent variables where the observed variables are the words of
the document and the latent variables are the topic structure. The process of
inferring the model is described as computing the conditional distribution of
the latent variables, or the posterior distribution given the observed variables
using the joint distribution defined.
If we say the distribution over topics z of a document is p(z) and the
probabilistic distribution over words w given topic z is p(w|z), we can write
p(zn = k) as the probability that the jth topic was sampled for ith word
token and p(wn|zn = k) as the probability of word wn under topic k. In such





p(wn|zn = k)p(zn = j), (4.1)
where K is the given number of topics. The conditional distribution p(w|z =
k) is the multinomial distribution over words for topic k and distribution p(z)
is the multinomial distribution over topics for document m and are denoted
as φk and θm respectively.
While PLSA does not provide any assumptions about how φ are generated,
LDA introduces a conjugate prior for the multinomial, a Dirichlet prior on
θ. K-dimensional Dirichlet distribution over the multinomial distribution









where the parameters α is a K vector where each components are positive,
and Γ(x) is the Gamma function. It is convenient to use a symmetric Dirichlet
distribution where each component of the parameter is equal to the same
value. The Dirichlet prior smoothes out the topic distribution where larger
α leads to more smoothing on topic distribution in a document.LDA has
another Dirichlet prior, β on φ, which smoothes the word distribution on
topic.
Given parameters α and β, the conditional distribution of a document, the
joint distribution of a topic mixture thetam, the topic mixture proportion for
all documents φ, words for document wm and topic assignment of words to
topics zm is as follows:




where Nm is the document length and n is the index for the word. From
(4.3), the marginal distribution of a document can be obtained by summing




p(wm,n|φk)p(zm,n = k|θm), (4.4)
which is equivalent to (4.1) with document-specific mixture. The probability
of a corpus can be found by taking the product of the marginal probabilities
of single documents as follows:









whereW is the corpus. Following the above discussion, LDA can be presented
using a probabilistic graphical model as shown in Figure 4.1. Each node is a
random variable where hidden nodes are unshaded and the observed nodes
are shaded. The rectangles are “plate” notation, which denotes replication.
The N plate is the collection of words within documents and the M plate is
the collection of documents within the collection.
45
Figure 4.1: Probabilistic graphical model for latent Dirichlet allocation.
4.2.2 Generative models
LDA assumes that a generative process is used to create documents and re-
versely postulates a latent structure consisting of a set of topics by observing
the data (the words) and performing probabilistic inference on the model.
The training process can be viewed as fitting a generative model such that the
best set of latent variables that can explain the observed data is searched.
For the generative process, the words are selected from the bags of words
representing each topic and used to create documents that are mixtures of
topics.
4.3 Collapsed Gibbs Sampling
In order to compute the conditional distribution of the topic structure given
the observed documents, we perform a variant of Gibbs sampling called col-
lapsed Gibbs sampling [66, 67]. The conditional distribution or the posterior
46
Algorithm 3 Generative model for LDA
1: procedure Generative process
2: for all topics k ∈ [1, K] do
3: sample mixture components φk ∼ Dirichlet(β)
4: end for
5: for all documents m ∈ [1,M ] do
6: sample document length Nm ∼ Poisson(ξ)
7: sample mixture proportion θm ∼ Dirichlet(α)
8: for all words index n ∈ [1, Nm] ∈ document m do
9: sample topic zm,n ∈Multinomial(θm)




with the hyperparameters omitted is given by







k=1 p(zi = k, wi)
. (4.6)
The denominator is the marginal probability of the observation and is
computationally intractable. Here we apply Gibbs sampling to approximate
the posterior. A Markov chain is defined on the latent topic variables and
the algorithm is run to collect samples from the limiting distribution and
approximate the distribution from the collected samples. In this setting, the
desired Gibbs sampler uses the full conditional p(zi|z¬w) to simulate p(z|w).
The conditional can be derived from the joint distribution p(z, w|αβ) as
shown in [68]











m,¬i + αk]− 1
, (4.7)
where the mixtures θ and φ were integrated out, hence the “collapsed” Gibbs
sampling. Nn and Nk are matrices of counts for number of times word n is
assigned to k and number of times words are assigned to topic k in document
m.
The multinomial parameter sets θ and φ for the state of the Markov chain
are found by using their definitions as multinomial distributions with Dirich-
let prior. By using Bayes’ rule and using the expectation of the Dirichlet
47













where m, n, and k are the document number, the word, and topic, respec-
tively.
Using (4.7), (4.8) and (4.9), the Gibbs sampling algorithm given by Algo-
rithm 4 can be executed. It uses five large data structures. The distribution
tables nzm and n
n
z , have M × K and V × K, respectively. Their sums are
stored with dimension M and K. The state variable zm,n has dimension of
W , which is the size of the entire corpus.
4.4 Why LDA Hardware Acceleration
We offer here some thoughts about why LDA is an interesting demonstra-
tion vehicle for acceleration. First, let us consider the platform issues. With
GPU cloud computing becoming more common, it is natural to see some
work done on GPUs [69, 70, 71, 72]. Although implementations on GPU
show orders of magnitude performance gains over CPU, there are some ar-
chitectural limitations that are hard to overcome. First, the limited memory
capacity forces GPU-based implementation to have more complex schemes
for streaming input and model data. Also, tuning parameters, such as the
number of chunks, the number of threads, and the number of workers, require
a lot of effort and can largely affect the performance. Most of the work shows
much effort put into overcoming architectural limitations of fitting an LDA
problem on GPU. We would like to avoid such inefficiencies.
Instead of trying to solve a problem using a hardware platform that is not
ideal for LDA, we identify FPGA as a good potential alternative accelerator
platform. The last few years have shown the remarkable pace at which FPGA
based accelerators have begun to appear in the data center. Obviously, the
Microsoft Catapult experiment [73, 74] shows one path for commercialization
48
Algorithm 4 LDA collapsed Gibbs sampling
1: procedure LdaCollapsedGibbsSampling
2: for all documents m ∈ [1,M ] do
3: for all words w ∈ [1, Nm] in document m do
4: sample topic index zm,n = k ∼Multinomial(1/K)
5: increment document-topic count: nkm+ = 1
6: increment word-topic count: nnk+ = 1
7: increment document-topic sum: nm+ = 1
8: increment word-topic sum: nk+ = 1
9: end for
10: end for
11: for iteration t ∈ [1, T ] do
12: for all documents m ∈ [1,M ] do
13: for all words w ∈ [1, Nm] in document m do
14: decrement document-topic count: nkm− = 1
15: decrement word-topic count: nnk− = 1
16: decrement document-topic sum: nm− = 1
17: decrement word-topic sum: nk− = 1
18: sample topic index k ∼ p(zi|z¬i, w)
19: increment document-topic count: nkm+ = 1
20: increment word-topic count: nnk+ = 1
21: increment document-topic sum: nm+ = 1




26: read out parameter set φ
27: read out parameter set θ
28: end procedure
of these ideas. In the Catapult design, every node in an Azure-scale cloud
datacenter is enhanced with a large plug-in FPGA, which is itself connected
directly to the top-of-rack router via high-speed links, thus permitting not
only custom computation but fast (microsecond-level latency) communica-
tion from any FPGA to any other FPGA in the data center. In the same
timeframe, Intel has recently acquired Altera [75] and has announced plans to
integrate FPGA fabric on server-class chips, and also as a separately located
chip inside each server box. This suggests that there may be a real platform
opportunity to explore larger ML tasks on custom computing fabrics.
Second, the LDA task itself has the attractive characteristic of being ex-
tremely large [76, 77, 78, 79]. For example these tasks can comprise millions
49
of topics and 1-million-word vocabulary on billions of documents, and the
dynamic range of the numerical quantities being manipulated is extremely
wide (i.e., lots of very small probabilities). Of course, there are some con-
founding issues here that make this analysis challenging. A real problem in
these practical designs is simply accessing the documents and moving rele-
vant numerical quantities attached to them, from their location in storage
to the compute engine that needs them. We do not wish to address these
data-movement issues (though we note that they seem to comprise yet an-
other attractive universe of performance problems). We do, however, wish to
understand whether the core computational problems are “accelerator wor-
thy”. But even here, there are challenges. No practical system computes
“everything” associated with a simplistic LDA model; it takes advantage of
some of the sparsity structure inherent in the problem. For example, most
documents do not include text that matches most topics, so many of the core
components of the sampling calculation can either be omitted, or approxi-
mated with appropriately sensitive lower-precision surrogates. Nevertheless,
we find much to be explored in the LDA model, and our work looks at some
simple LDA kernels to see what might be worth acceleration. Toward that
end, we have done one simple experiment to validate some of these ideas,
discussed in the next section.
4.5 Profiling a Simplistic LDA Inference Engine
Much research has been performed on scaling LDA to larger datasets and
models with algorithmic focus, systems focus or both. The recent large-scale
implementations of LDA train on up to billions of documents using as many
as thousands of cores on a cluster. One of the reasons why a large cluster is
required is the inference algorithm being a limiting factor [80]. There have
been many works on developing a better LDA inference algorithm for speed,
but here we will concentrate on accelerating a commonly used sampling al-
gorithm instead of fancier ones that require more control. For the purpose
of early explorations, we will disregard how the data is distributed across
different machines on the cluster and focus on a single machine performance.
We took a free C++ LDA implementation using Gibbs sampling called
GibbsLDA++ [81], and profiled it using gprof and Intel VTune.
50
Figure 4.2: Profiling result for GibbsLDA++ library show that more than
96% of time is spent on Gibbs sampling.
Figure 4.2 shows the results of our simple profiling experiment. This simple
Gibbs sampler for LDA spends over 96% of its total time doing the core
sampling computations. Other tasks such as data movement, housekeeping,
sorting, etc., are essentially just noise. This suggests that there is something
challenging going on inside the core LDA compute kernel itself.
4.6 FlexLDA Architecture
In this section, we describe the overall architecture of FlexLDA, our FPGA-
based LDA implementation using Gibbs sampling inference. The basic prin-
ciple behind the design process was simplicity and flexibility.
FlexLDA assumes that the data is streamed either from DRAM or from
the host CPU into the accelerator. A single word is sampled at a time and the
entire design is pipelined. Its inputs are the word that is currently sampling
along with its document number m and topic number k along with Dirichlet
variables
Before we go into more detail about the architecture, we divide the process
51
Figure 4.3: Three stages of LDA accelerator for processing one word token
The first and last stages update the local distribution tables for given topic.
The second stage performs Gibbs sampling by streaming in parts of the local
distribution tables for all possible topic values.
into three major stages based on local cache access. Figure 4.3 shows three
stages where each of the stages performs an instruction that accesses the
cache that contains the distribution tables. Each of the stages are described
below.
• Subtract current topic : Given the current topic of the input word
token, the word-topic and document-topic distribution tables are up-
dated. The count for the table element corresponding to the current
topic is decremented by 1.
• Gibbs sampling : This stage uses the document number, the word
value and the locally cached distribution table values, to sample a new
topic for the word token. The parts of the local distribution tables to
be streamed are determined by the document number and the word
value for all possible topic values.
• Add new topic : Using the new topic sampled from Gibbs sampling
stage, the word-topic and document-topic distribution tables are up-
dated. The count for the table element corresponding to the new topic
is incremented by 1.
4.6.1 Sampling a word token
For the FPGA implementation, three stages mentioned above are imple-
mented each as a functional block residing on a common bus as shown in
Figure 4.4, while some of the inputs are shared between the three blocks,
such as document number and the word token. Three blocks will each have
52
Figure 4.4: Top level architecture for sampling word token. All the blocks
are connected through a common bus. The top three blocks perform the
three instructions for sampling a word token by reading in cached values
from the distribution tables, the bottom four blocks.
separate signal corresponding to the topic information needed. The decre-
ment table block requires the current value of the topic. The Gibbs sampling
block uses the total number of topics to traverse through all possibilities to
sample the topic. The increment table block takes the new sampled topic to
update the values in the local distribution tables.
The three stages in Figure 4.3 correspond to the three pipeline stages of the
hardware accelerator built on the FPGA, although pipelining is not enabled.
A simple handshaking logic for the bus can be enabled to pipeline three
instructions on the accelerator for higher throughput.
4.6.2 Update tables
In order to subtract current topic and add new topic, we used two identical
copies of the update table block, which is shown in Figure 4.5. It is set
to perform either instruction by setting a flag indicating whether to use

































































Figure 4.5: Update table block. The left side of the block diagram
computes the addresses used to find the values corresponding to the word
from the distribution tables. The right side takes the read values and either
decrements or increments them to produce new values to be written back to
the distribution tables.
as the decrement table block and the increment table block.
The logic itself is very simple. We take the word token, the number of
documents, and current topic to calculate addresses to be use for distribution
tables. The values stored in the distribution tables are read into the block
and updated, and decremented or incremented based on the flag pass into it.
4.6.3 Gibbs sampling
Once the decrement table block is done updating, Gibbs sampling block
starts sampling a new topic. Figure 4.6 shows the Gibbs sampling block.
The logic for sampling a new topic sits within a nested block called the sam-
ple new topic block. The left side of Figure 4.6 shows the address generation
logic. Once the document number, the total number of topics, and the word
are entered, the addresses for accessing four distribution tables are generated
every cycle. The generated addresses are used to read in values from the
distribution tables, corresponding to the input variables mentioned above
54
Figure 4.6: Gibbs sampling block uses the document number, the total
number of topics, and the word, to calculate addresses for reading in values
from the distribution tables, as shown on the left side. The addresses are
updated every cycle and values are streamed into FIFOs and fed into the
sample new topic block.
for all possible topics. The values are cached in FIFOs to limit the Gibbs
sampling block’s control over the bus. Without these FIFOs, the Gibbs sam-
pling block would have to keep checking on the bus while sampling through
all the possible topics. The size of the FIFOs are determined by the number
of topics the system will support.
4.6.4 Finding conditional distribution
The very basic kernel in Gibbs sampling for LDA is finding the conditional
distribution, which is derived from the joint distribution. Our conditional
distribution computation block shown in Figure 4.7 finds a conditional for a
word token, using the hyperparameters α and β and counts of words from
topic k, counts of topic k observed with a word of document m, and the sum
of them for each topic and document. We have parallelized the computation
as much as we can knowing that area should not be a big concern for our
design. The block contains four 32-bit adders for finding two denominators
55
Figure 4.7: Conditional distribution computation block streams the
word-topic and document-topic distribution table values in with beta and
alpha values. A tree of arithmetic operations will be executed for a
conditional probability corresponding to a topic.
and numerators. The next stage in the pipeline multiplies the two numerators
and denominators with each other prior to division in the following stage.
This produces one conditional for one topic k. This computation is repeated
for all topics in K.
4.6.5 Updating topic using conditional probabilities
As the conditional probabilities are computed for each topic k, they are used
to create cumulative sum for each k and are stored in a depth-128 FIFO
as shown in Figure 4.8. Once the conditional probabilities and cumulative
sum for all topics are computed, the final sum is multiplied by a 32-bit
uniform random number, generated from a pseudo random number generator
built similar to [4]. The pseudo random number generator used for source
separation hardware in the previous chapter has been reused. The result is
then compared to each of the cumulative sum values computed for each topics























Figure 4.8: Sample new topic block takes each of the conditional
probabilities found for each topic and samples the new topic for the input
values entered.
4.7 Parallelization
Our architecture requires very low bandwidth to sample one word. It takes
48-bit input which includes document number (16b), topic (16b) and word
and operates on it for many cycles. The block shown in Figure 4.4 can
operate in parallel given that memory bandwidth is not limited using the
data partition scheme described in [69]. For word wi in document mi and
word wj in document mj, if wi 6= wj and mi 6= mj, simultaneous updates
of topic assignment by (4.7) will not have memory conflicts on word-topic
distribution table Nn and word-topic distribution table Nk. It does require
a synchronization step to aggregate local matrix numbers to global numbers
and local matrix numbers will be updated by this.
4.8 FPGA Implementation
We have implemented our design on Xilinx Virtex UltraScale FPGA VCU110
Development Kit [82] to study performance and estimate power. Table 4.1
shows features for the FPGA on the development board.
Our FPGA implementation runs at 220 MHz, while utilizing resources
shown in Table 4.2 for just one core. Furthermore, dynamic power and static
57
Table 4.1: XCVY190-2FLGC2104E FPGA
Feature Number













power consumed are estimated by Xilinx Vivado tool to be 1.122 W and
1.871 W respectively. These power numbers are for estimation only, and
more accurate power should be measured from the board.
Our single core processing 23469 words for 20 topics took 3.65 million
cycles at 220 MHz which resulted in 0.016 seconds. The C++ software ref-
erence on Intel Xeon took 0.02 seconds. Our single core achieved a speedup
of 1.25X. It takes approximately 0.707 µs to process one word and, assuming
average length of 200 words per document [80], 0.141 ms per document and
roughly 25 million documents per hour. Furthermore, because our input is
only 48 bits the number of parallel cores can be drastically increased, fur-
ther improving the throughput number linearly until the memory bandwidth
is saturated. Our DRAM interface implemented on-board currently has 14
Gbit/s bandwidth and our single core consumes around 70 Mbits/s, meaning
we can theoretically have 200 cores in parallel assuming the data is parti-
tioned in advance with the condition mentioned above. The 200 cores will
process up to 5 billion documents per hour.
58
4.9 Summary
We have implemented our LDA accelerator implementation called FlexLDA
using collapsed Gibbs sampling inference on Xilinx VCU110 FPGA. Our sin-
gle core achieves 1.25X speedup compared to Intel Xeon core while running
at much lower frequency of 220 MHz. It is capable of processing one word in
0.7 µs and one document in 0.14 ms. This roughly translates to 25 million
documents per house. Although there are many different layers in which par-
allelization could have been implemented, we have reported the number for
non-parallel single core implementation. Our implementation only requires
48 bits per cycle, and because it requires such low memory bandwidth and is
designed to allow parallel computation, we can theoretically have up to 200
cores on our machine, just by simply having more cores operating in parallel,
processing 5 billion documents per hour.
These results show great potential for our design. Further improvements
can be made by parallelizing on the core level to sample more than one word
at a time, pipelining instructions on each core, and parallelizing Gibbs sam-
pling procedure within the core. Parallelization of LDA using Gibbs sampling
requires an evaluation of asynchronous update mechanism for asynchronous
Gibbs sampling, and will be discussed in more detail in the next chapter.
59
Chapter 5
Towards Parallel Gibbs Sampling
So far, although both mobile sound source separation and cloud topic mod-
eling hardware have been designed to allow multiple concurrent pipelines to
run, we have not actually experimented with parallelizing Gibbs sampling
as we either did not need more throughput (for source separation) or have
not concluded the experiment with asynchronous Gibbs sampling. In this
chapter we share our study on parallelizing Gibbs sampling and offer some
directions for future studies.
5.1 Parallel Gibbs Sampling MRF Inference
Geman had implemented Gibbs sampling to solve the Bayesian image restora-
tion problem in 1984 [22]. Geman also suggested that each collection of vari-
ables can be assigned to an independently running asynchronous processor.
Since then, many have attempted to achieve faster convergence by coming up
with variations of Gibbs sampling, another MCMC method, and/or parallel
implementations. Also, researchers have noticed that asynchronous Gibbs
sampling works well in practice [76, 66, 67] and studied theoretical explana-
tions to better understand why and when it works well [83].
The parallelization efforts can roughly be classified into two categories:
parallel chains and single chain [84]. Parallel chains are created from the
same MCMC scheme, just as single chains are created except that they will
tend to be shorter than single-chain versions. Parallel chains have a higher
probability of better directing convergence to the equilibrium distribution.
However, every chain will have a burn-in period, which is the time from
the start until useful samples are obtained from the conditional distribution.
In principle, samples obtained after the burn-in period should be from the
equilibrium distribution. Only with an assumption that the burn-in periods
60
Algorithm 5 Chromatic Gibbs sampling algorithm
1: procedure ChromaticGibbsSampling(x)
2: for t = 1 to T do
3: for k = 1 to K do
4: parfor i in color k do
5: x
(t+1)
i ∼ P (xi|x
(t+1)












are relatively short, can running parallel chains on different processors be the
most effective way to maximize performance.
Furthermore, Gibbs samplers often have very poor mixing and convergence
properties, which forces a significant portion of samples to be discarded, and
the iterations left after the burn-in are often down-sampled, e.g. 1 out of 100
samples. For the worst-case scenario, it would be unwise to use the parallel
chain method as it will be wasting a lot of samples created from multiple
chains.
For single chains, parallelization depends on the conditional independence
structure of the model. As long as an update of a variable does not depend on
another variable that is being updated, they can be done in parallel. In other
words, due to local Markov property which states that a node is condition-
ally independent of all other nodes given its neighbor, all nodes that are not
neighbors of each other are conditionally independent and can be sampled si-
multaneously. A variant of Gibbs sampling, called chromatic Gibbs sampling
[85], adapts the classical graph coloring technique in a “checker board” pat-
tern, allowing direct parallelization of single chain sequential Gibbs sampler
while maintaining the ergodicity, generating chains that converge to targeted
stationary distribution. Figure 5.1 shows an example MRF colored in dark
and light colors. The dark node can be executed in parallel and white can be
executed in parallel as well, once the processing for dark colors is finished.
The C++ prototype has been modified to create a novel MRF inference
kernel, which we named chromatic Gibbs inference that uses chromatic sam-
pling. The chromatic Gibbs inference is parallelized via Pthreads and CUDA
for multi-core CPU and GPU.
61
Figure 5.1: Chromatic Gibbs inference using Pthreads. The graph is
divided into N horizontal regions and each region is processed with
chromatic Gibbs sampler.
5.2 Multi-core CPU Implementation
First, the baseline implementation was parallelized using Pthreads in or-
der to check the scalability of this application on the system with a multi-
core CPU. All the evaluations were done with a computer system with a
6-core Intel Xeon X5650 2.67GHz CPU, 24GB of memory, and Scientific
Linux (release 6.1). The hot spots in the code were analyzed using gprof
(2.20.51.0.2-5.20.el6 1.1), Intel VTune (2013.update2). An input MRF has
513x966 nodes, and every node in the MRF was processed 500 times. The
runtime of the same number of cores are measured five times, and their av-
erage is reported.
Figure 5.1 shows how the chromatic Gibbs inference was parallelized using
a multi-core CPU. If N cores are available, the input MRF is divided into
N horizontal regions; the N -th core processes the same type of nodes (e.g.
white nodes) in the N -th region at the same time. After all the cores finish
the jobs, the same things are done again for the remaining type of nodes (e.g.
dark nodes). In order for the chromatic Gibbs inference to process each node,
the chromatic Gibbs inference needs to fetch data for the node from the main
memory, and the data for the 2D MRF is stored in the main memory in the
row major order. Therefore, assigning nodes in the same row to a core can























Data"Cost" Smoothness"Cost" Calculate"Probabilites" RNG" Sampling"
Figure 5.2: The execution time of the chromatic Gibbs inference on a
multi-core CPU system. For lower number of threads such as two or three,
the overhead of using multiple threads outweighs the benefits.
Figure 5.2 shows how the execution time of the chromatic Gibbs infer-
ence varies when different number of cores are used. Each execution time is
decomposed into the five major tasks, which are calculating data cost, calcu-
lating smoothness cost, calculating sampling probabilities, random number
generation, and drawing new labels.
To calculate the data cost of a node, it is necessary to access data of
the node in the main memory. On the other hand, calculating smoothness
cost additionally requires the chromatic Gibbs inference to access data of
four adjacent nodes from the row major 1D array in the main memory. The
data access pattern of smoothness cost calculation makes cache space locality
worse and, consequently, it spends much more time than data cost calcula-
tion. However, it scales better than data cost calculation as the number
of cores increases. Calculating sampling probabilities includes exponential
operations that are expensive in terms of runtime, and its runtime is lin-
early improved as the number of cores increases. In each core, the compiler
automatically unrolls loops and uses sse2 vector instructions.
In the multi-core implementation, the rand function in the baseline imple-
mentation has to be changed to the rand r to guarantee that random number
63
sequences generated in each core are independent to each other, and it in-
creases runtime. The runtime of rand r decreases fast as the number of cores
increases.
As we can see, the performance of the chromatic Gibbs inference performs
better as more cores are involved. We suspect that GPUs which has a larger
number of parallel cores will result in far greater speedup for our application.
5.3 GPU Implementation
For our second implementation, we have taken the same baseline chromatic
Gibbs inference kernel and parallelized it for GPUs by using CUDA for par-
allel programming. This implementation was run on a machine with 2.3
GHz Intel Core i7 processor (I7-3615QM), 16GB of memory, and NVIDIA
GeForce GT 650M GPU, running Mac OS X 10.8 Mountain Lion. GeForce
GT 650M has 1GB of global memory and 2 multiprocessors each with 192
CUDA cores running at 900MHz. It has 64 kB constant memory and 48 kB
shared memory per block with a warp size of 32, where warp is the minimum
size of the data processed in SIMD fashion by a CUDA GPU. CUDA driver
version 5.0 was used GPU of capability version 3.0. We use this hardware to
process 513x966 MRF we created from sound mixtures and it is iterated 500
times.
The baseline chromatic Gibbs inference kernel assumes that each node can
be simultaneously updated independent of other nodes in the same coloring.
However, Pthreads implementation above showed that there is a limit to how
much multithreading can benefit over the overhead caused from it. The idea
is to copy the MRF data cost and labels in row major order to the shared
memory in the device and create a local copy of a smaller fragment of the
MRF and use it to perform MRF inference for the nodes associated with it, in
each thread. As opposed to multi-core implementations that require reading
from and writing to global memory, after an initial copy of the MRF data
from the host memory to device global memory, each iteration and update can
stay within the device and utilize a much higher internal memory bandwidth.
We illustrate how we structured our memory organization in Figure 5.3. Each
thread performs MRF inference for a 4x4 subgraph of the MRF. The threads
in a 8x8 neighborhood were organized into a block. For the 513x966 source
64
(0,0) (0,1) (0,2) (0,3)
(1,0) (1,1) (1,2) (1,3)
0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 15
Figure 5.3: Grid, block and thread organization. Each thread processes 4x4
subgraph with 8x8 threads making up a block.
Figure 5.4: The labels from the neighboring nodes of the 4x4 subgraph are
copied from shared memory.
separation MRF, we end up with a 17x31 grid of blocks with 64 threads each.
As shown by the two “checkerboard” patterns for each thread in a 8x8
block, the thread will process 4x4 nodes in two iterations, updating the
highlighted (dark) parts of the 4x4 matrix. When actually updating the
4x4 nodes, we need to see neighboring node’s label in order to calculate the
smoothness cost as shown in Figure 5.4. The neighboring nodes’ labels are
copied from the shared memory to 4 arrays for each of the 4 sides of the
matrix. These labels are used only to calculate smoothness cost and not
updated with the 4x4 matrix.
An efficient generation and high quality of of random numbers is crucial
for the performance of Gibbs samplers. As our design require a large number
of random numbers to be generated fast in parallel, we use an array of linear

















Figure 5.5: Runtime for varying width of the source separation MRF. From
column sizes 8 to 64, there is little increase because initial overhead has not
been masked by parallelism as there are not enough thread occupancies
within the streaming multiprocessor.
sequence of random numbers with n ∈ N can be generated by
Xn+1 = (a Xn+ c) mod m, (5.1)
where a, c and m are integer coefficients. We selected a = 1, 664, 525, c =
1, 013, 904, 223 and m = 231 as our parameters as suggested in [87]. The
m is selected as 231 to allow modulus operation to be computed by merely
truncating all but rightmost 32 bits. The pseudo random number sequence
X ∈ [−231, 231−1] is obtained and normalized with 231 to generate uniformly
distributed pseudo random numbers.
To check the scalability of GPU implementation, the fraction of GPU pro-
cessing time for allocation, memory transfer and actual inference has been
compared for different size of the MRF. Because the height of the MRF de-
pends on the STFT size, it should be fixed to 513, whereas the width will
increase if the input sound mixtures are longer. As the longest test bench
spans 966 nodes, we test widths from 8 to 1,024 columns and the total run-
times for each width are shown in Figure 5.5. We omitted plotting times
for memory allocation and transfer as they take up less than 0.3% of to-
tal runtime. The runtime shows linear increase in runtime as the width is
increased, with the exception of column sizes 8 to 64 where the runtime fluc-
tuates around 800ms. This is due to the fact that there is insufficient thread

























Height"="1" Height"="2" Height"="3" Height"="4" Height"="5" Height"="6" Height"="7" Height"="8"
Figure 5.6: Runtime for blocks with horizontal orientation
from memory allocation and transfer shows that the CUDA implementation
is optimal for our problem. Also, the size of the input data is correlated
with the runtime. This correlation indicates that, by having different sizes
and orientation of the thread block, we may be able to further optimize our
implementation.
In order to explore the performance for different block sizes and orienta-
tions, we tried a permutation of varying widths and heights of the thread
block. Figure 5.6 shows horizontal block orientations that stretch in the x
direction and Figure 5.7 shows vertical block orientations that are long in y
direction. Each result is an average of five runs made for each configuration.
It is expected that the block orientations in the y direction will perform worse
as the MRF data is stored in row major order. When the data is copied to
the shared memory from the global memory, horizontally long blocks will
have better locality of data than the vertical blocks. For thread blocks with
512 threads or more (i.e. 8x64), the speedup is not available due to 48 kB
shared memory restrictions.
For blocks with horizontal orientations, it is immediately noticeable that
thread blocks that are not too small and not too large are the best performing.
First, for thread block width of 8, the best performance is achieved with
height of 3 to 5. For thread block width of 16, the best performing block
height is between 2 and 4. For width 32, its best height is from 1 to 3 and




























Height"="8" Height"="16" Height"="32" Height"="64"
Figure 5.7: Runtime for blocks with vertical orientation.
will never achieve a comparable performance. This is due to zeros padded
when the last rows and columns of blocks cover far more than the original
MRF dimensions. This increases the number of nodes, including the zero
padded arbitrary nodes that were created to fill the last row and column
blocks. All the best performing sizes are in the range of 32 threads to 94
threads. Because the warp size is 32, when the number of threads fall below
32 threads, it will still be processed as if it has 32 threads, resulting in
redundant computations. Furthermore, if the number of threads in a block
is small, it will reach maximum current blocks of 16 blocks per streaming
multiprocessor faster resulting in low occupancy of GPU resources.
The vertical orientation results are generally worse than the horizontal
ones, due to row major order of MRF data stored in the memory. Further-
more, because the height of the MRF is smaller, more zero padding could
occur than the horizontal direction, as the height of the block is increased
to a large number. The overall trend shows decrease in performance as both
width and height are increased just like the horizontal orientation results.
The results for thread blocks with less than 32 threads are omitted due to
the reason given above (e.g. warp size of 32).
68
5.4 Summary
We have shown two parallel implementations of sound source separation using
MRF and chromatic Gibbs inference. By parallelizing the C implementation
of chromatic Gibbs inference using Pthreads for multi-core and CUDA for
GPUs, we achieved up to 1.9X and 22.9X speedups each, compared to our
baseline chromatic Gibbs inference. By looking at the experimental results
such as portion of memory allocation and transfers within the total runtime,
we can verify that GPU implementation spends less than 0.3% of time on
memory operations and the rest on the actual inference computations. The
different orientations for the block affect the performance on GPU as the
MRF data is stored in certain order. For our implementation the data was
stored in row major order, thus horizontal orientations work better than the
vertical ones. The preferred orientation is dependent on how the MRF data
is stored into the memory and can be altered if we were to modify the order
the MRF data is stored in memory.
Furthermore, the number of threads in a block is a significant factor in
GPU performance. One must know the hardware specifications of the GPU
used for execution to take full advantage of the parallelism. In our case, the
sweet spot lies between 32 and 96 threads per block. Our initial implemen-
tations of 8x8 block and horizontal oriented blocks, i.e. 1x64, 2x32, etc., are
the best performing configurations for sound source separation using MRF
via chromatic Gibbs inference. The separation of graph using multiple par-




Reduced Precision Gibbs Sampling
On the surface, sampling based inference methods seem like they might be
intrinsically somewhat robust to foundational statistical upsets. They get
their final answers by “integrating” over a great many random samples. Per-
haps if the random samples are slightly wrong, it might not be fatal; perhaps
we might just run a more random samples to compensate.
We hypothesize that most of the “action” in a Gibbs sampler happens with
numbers very close to zero. This means there is an interesting opportunity
to explore “lightweight” floating-point reduced precision architectures in this
context instead of conventional method of using fixed-point computes.
We note that there are numerous advantages to small floating-point for-
mats, mainly explored in the past in the context of power reduction, and
approximate computing. The cost of the floating-point unit (FPU), in terms
of both chip area and power, is strongly correlated with the number of bits
used. Most common implementations of the floating-point units are 64-bit
double precision and 32-bit single precision FPUs. However, numerous works
has shown that not all applications require the full 64-bit double precision
and sometimes not even 32-bit single precision. Tong et al. [88] showed that
floating-point arithmetic for real-life applications such as speech recognition,
neural network algorithm, fingerprint classification, and imaging processing
can be done in reduced floating-point implementation accurately computing
the result with up to 60% reduction in energy/operation in the FPU. Fang et
al. [89] implemented inverse discrete cosine transform (IDCT), widely used
in real-world applications such as video coding and speech recognition, to
demonstrate 15-bit floating-point arithmetic using only 10% power of 32-bit
IEEE floating-point implementation. Lee et al. [90] performed bit reduction
down to 16 bits on a face detection algorithm to reduce the size by 50% and
power by 17%. Our current results show that we should be able to scale the
bits down further than the related work mentioned above. All of these ad-
70
vantages may continue to accrue for bit-reduced floating-point in the context
of sampling. However, the key difference in our application is that we hy-
pothesize that because we are dealing with many numerical results very close
to zero, we need floating-point specifically to preserve sufficient accuracy to
have any corrective capability.
6.1 Numerical Profiling for a Gibbs Sampling Engine
As an attempt to get an idea of what the computations that take place
within the Gibbs sampler look like, we have taken stereo sound mixtures
and separated them with the Gibbs sampling inference while observing the
numerical distribution of the numbers.
The Gibbs sampler takes data costs from the MRF model as inputs prior to
adding smoothness costs and performing Gibbs sampling. Figure 6.1 shows
that data costs or the input to the Gibbs sampler have a dynamic range
that spans from 0 to 400, with 89.4% of the data costs being less than 10.
Furthermore, 66.79% of the data costs are less than 5, and 13% less than 1.
Once the total cost is found, it is multiplied by a fixed constant β and
converted to a probability by applying it to a negative exponential function.
The resulting probabilities are plotted in Figure 6.2.
Most of the numbers are extremely small fractions with 52.2% of the prob-
abilities being smaller than 0.2 and 36.9% less than 0.1. Furthermore, 12.4%
of the probabilities were less than 0.01. This leads us to believe that floating-
point arithmetic may be more suitable given the level of precision that can
be achieved for numbers of bits given. As the first step, the number of bits
in floating-point is reduced to see the effect on the label decision error rate,
which is determined by comparing the correct output label and the label
produced by the reduced floating-point Gibbs sampler.
Figure 6.3 shows label decision error rate plotted against different bit-
reduced floating-point formats. When reducing the number of exponent bits,
it is not until the number is reduced to 4 bits or less that it starts having
some effect on the error rate. At 4 exponent bits, error rate of 0.2% is
distributed for all configurations of mantissa (or significand) bits. If we limit
the number of exponent bits to 4 or higher, the number of significand bits
can be reduced to 3 without exceeding 1% error rate. When the number of
71






















Histogram of Input Data Costs
(a)
























Normalized Histogram of Input Data Costs
(b)
Figure 6.1: (a) Histogram of input data cost. (b) Normalized histogram of
input data cost.

















































Normalized Histogram of Probabilities
(b)
Figure 6.2: (a) Histogram of probabilities. (b) Normalized histogram of
probabilities.
72
Figure 6.3: Decision error rate for bit-reduced floating-point
representations. The number of mantissa (or significand) bits from 1 to 23
and the number of exponential bits from 1 to 8 are plotted.
significand bits are reduced down to 2 and 1, it shows error rates around 4%
and 7%, respectably for exponent bits of 4 or larger.
6.2 Bit-Reduced Floating-Point for Sampling
The results of the numerical study of the previous section both confirm our
overall hypothesis about the arithmetic being computed during sampling, and
raise some new interesting opportunities. First, we note that, as predicted,
the Gibbs sampler is computing across a wide dynamic range, i.e., many
numbers very close to zero, but also some large numbers. The initial engine
design uses a 32-bit point format, and for probabilities, these are all fractional
bits. But a surprisingly small custom floating-point version of the arithmetic
appears to work well. At the very least, we can use a very small floating-point
block, we believe.
However, the data reveals other opportunities. The sampling block works
surprisingly well with extremely small floating-point - 8-bits or even less -
which suggests that perhaps we might use only this arithmetic style for the
73
entire engine. This creates another question: How might we do better than
a sampling engine that is already built out of very small floating-point com-
putations? One answer might be: With yet more very small floating-point
units that vote [91, 92, 93, 94, 95]. The problem with conventional modular
redundancy architectures, e.g., triple modular redundancy or higher, is not
that they do not work well, but that it is unreasonably expensive to use 3
or 5 or 7 or more complete copies of the computational data path to enable
appropriate voting. But if the core computational path is itself extremely
small, this opens up some new voting architectures. We are particularly in-
terested in the idea of using really small floating-point - less than 8 total
bits, which means each arithmetic block is very few gates, with rather small
logistical depth - and exploring whether a voting or “median” sort of archi-
tecture may yield good results. This could also be coupled with the study
of asynchronous Gibbs sampling as these voting structures could easily be
converted to asynchronous updates.
6.3 Summary
We discussed how for the sampling architectures, a small floating-point replica
may be a better choice as it deals with many tiny numbers very close to zero.
Our numerical profiling shows that the Gibbs sampling engine deals with
extremely small fractions with more than 50% of the numbers being smaller
than 0.2 and 36.9% less than 0.1. Furthermore, 12.4% of the numbers were
less 0.01. Bit-reduced floating-point implementation of the Gibbs sampling
engine shows that even with 8-bits of floating-point, the decision error rate is
less than 5%. This shows the robustness of the algorithm and also suggests
that we may be able to create more energy-efficient architecture than regu-
lar fixed-point implementations. One of the architectures we envision to be
effective is a modular voting structure that can take advantage of multiple
very small floating-point computes to save power.
74
Chapter 7
Conclusions and Future Work
One prominent area of machine learning for unsupervised and semi-supervised
learning is probabilistic graphical models which are used for probabilistic
modeling of a problem where probabilistic inference is performed to reason
the model created. Unlike a popular supervised learning technique, deep
neural networks, the data does not necessarily come in a nice matrix form
and does not involve large number of matrix multiplications. In this work,
we designed and demonstrated a set of MCMC-based probabilistic inference
accelerators for probabilistic graphical models, namely, Markov random fields
and latent Dirichlet allocation, a undirected and a directed graphical model
respectively. We created a set of Gibbs sampling inference accelerator which
is a Markov chain Monte Carlo (MCMC) based method, one of the most pop-
ular families of inference algorithms used for not only graphical models but
also other areas of machine learning such as restricted Boltzmann machines
and deep belief networks.
We first described our work on a Gibbs sampling accelerator used in a
mobile setting. We took a mobile perceptual application called sound source
separation and modeled the problem using Markov random field. We then
performed maximum a posteriori inference using Gibbs sampling to get the
best set of labels that would separate the sound sources. We also used
expectation-maximization to update the Gaussian parameters used in the
MRF. We implemented all this on both FPGA and virtual ASIC. An FPGA
implementation runs at 150 MHz with 207 kB RAM while achieving SDR
up to 7.021 dB with only 1.6 ms latency, a 22X speed-up; it confirms real-
time streaming feasibility. A 45 nm virtual ASIC design running at 20 MHz
requires fewer than 10 million gates while consuming 70 mW, which is 52X
better than ARM Cortex-A9 software reference design. By using Synopsys
32/28 nm Generic library, we were able to achieve 40 mW while running at
150 MHz. Overall, we believe that our design shows that our MRF source
75
separation architecture is viable for real-time streaming implementation for
the mobile form factor.
We also implemented an enterprise-scale application that runs on the cloud
called topic modeling. We took the most commonly used directed proba-
bilistic graphical model called latent Dirichlet allocation and built the Gibbs
sampling inference accelerator for it.
While the Gibbs sampling algorithm is inherently sequential by nature,
there has been much work on parallelizing it by algorithmic modifications.
For example, a graph coloring scheme is used in chromatic Gibbs sampling
which allows same-colored nodes to be sampled in parallel. We implemented
a multi-threaded C++ version using Pthreads for multi-core and CUDA ver-
sion for GPUs to achieve 1.9X and 22.9X speedups respectively.
Although both our Gibbs sampling inference algorithms for MRF and LDA
have gone through precision analysis to implement mixed precision architec-
tures that adjusts the number of fractional bits to keep all the arithmetics less
than or equal to 32-bit fixed-point even when dealing with really tiny frac-
tional probabilities, we believe there could be room for further optimizations.
Our preliminary results with very small floating-point arithmetics show that
applications solved using probabilistic graphical models are quite robust to
quantization noise in both decision error rate and also the qualitative result
of the final output of the applications. We believe that more aggressive re-
duction and careful design, possibly through using very small floating-point
compute, with or without modular voting architecture, will allow us to com-
press the hidden variables learned throughout the model and reduce power
of the overall system.
We believe that asynchronous Gibbs sampling, where the sequentiality is
broken by some mix of asynchronous updates to the model, is a very promis-
ing direction to move forward as preliminary results in the research commu-
nity have shown. The future architecture can make use of this observation
to drastically parallelize the samplers for huge performance gains.
76
References
[1] C. Charoensak and F. Sattar, “A single-chip FPGA design for real-time
ICA-based blind source separation algorithm,” in 2005 IEEE Interna-
tional Symposium on Circuits and Systems, vol. 6, May 2005, pp. 5822–
5825.
[2] K.-K. Shyu, M.-H. Lee, Y.-T. Wu, and P.-L. Lee, “Implementation of
pipelined FastICA on FPGA for real-time blind source separation,” Neu-
ral Networks, IEEE Transactions on, vol. 19, no. 6, pp. 958 –970, June
2008.
[3] J. Detrey and F. De Dinechin, “A parameterized floating-point expo-
nential function for FPGAs,” in Field-Programmable Technology, 2005.
Proceedings. 2005 IEEE International Conference on. IEEE, 2005, pp.
27–34.
[4] T. E. Tkacik, “A hardware random number generator,” in CHES, vol.
2523. Springer, 2002, pp. 450–453.
[5] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classifica-
tion with deep convolutional neural networks,” in Advances in Neural
Information Processing Systems, 2012, pp. 1097–1105.
[6] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “ImageNet large
scale visual recognition challenge,” International Journal of Computer
Vision, vol. 115, no. 3, pp. 211–252, 2015.
[7] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Sur-
passing human-level performance on imageNet classification,” in Pro-
ceedings of the IEEE International Conference on Computer Vision,
2015, pp. 1026–1034.
[8] J. S. Yedidia, W. T. Freeman, and Y. Weiss, “Understanding belief
propagation and its generalizations,” Exploring Artificial Intelligence in
the New Millennium, vol. 8, pp. 236–239, 2003.
77
[9] C.-K. Liang, C.-C. Cheng, Y.-C. Lai, L.-G. Chen, and H. H. Chen,
“Hardware-efficient belief propagation,” IEEE Transactions on Circuits
and Systems for Video Technology, vol. 21, no. 5, pp. 525–537, 2011.
[10] W. Zhao, H. Fu, G. Yang, and W. Luk, “Patra: Parallel tree-reweighted
message passing architecture,” in 2014 24th International Conference on
Field Programmable Logic and Applications (FPL). IEEE, 2014, pp.
1–6.
[11] J. Choi and R. A. Rutenbar, “Hardware implementation of MRF map
inference on an FPGA platform,” in 2012 22nd International Conference
on Field Programmable Logic and Applications (FPL), Aug 2012, pp.
209–216.
[12] J. Choi and R. A. Rutenbar, “Video-Rate stereo matching using Markov
random field TRW-S Inference on a hybrid CPU+FPGA computing
platform,” IEEE Transactions on Circuits and Systems for Video Tech-
nology, vol. 26, no. 2, pp. 385–398, 2016.
[13] J. Choi and R. A. Rutenbar, “Configurable and scalable belief propaga-
tion accelerator for computer vision,” in 2016 26th International Confer-
ence on Field Programmable Logic and Applications (FPL), Aug 2016,
pp. 1–4.
[14] Y. Boykov, O. Veksler, and R. Zabih, “Fast approximate energy mini-
mization via graph cuts,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, vol. 23, no. 11, pp. 1222–1239, 2001.
[15] T. Gao, J. Choi, S. n. Tsai, and R. A. Rutenbar, “Toward a pixel-
parallel architecture for graph cuts inference on FPGA,” in 2017 27th
International Conference on Field Programmable Logic and Applications
(FPL), Sept 2017, pp. 1–4.
[16] R. M. Neal, “Probabilistic inference using Markov chain Monte Carlo
methods,” Department of Computer Science, University of Toronto
Toronto, Ontario, Canada, Tech. Rep. CRG-TR-93-1, Sep. 1993.
[17] C. Andrieu, N. De Freitas, A. Doucet, and M. I. Jordan, “An introduc-
tion to MCMC for machine learning,” Machine learning, vol. 50, no. 1-2,
pp. 5–43, 2003.
[18] M. Bayes and M. Price, “An essay towards solving a problem in the
doctrine of chances. By the late Rev. Mr. Bayes, F.R.S. communicated
by Mr. Price, in a letter to John Canton, A.M.F.R.S.” Philosophical
Transactions (1683-1775), pp. 370–418, 1763.
78
[19] J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of
Plausible Inference. San Francisco, CA, USA: Morgan Kaufmann Pub-
lishers Inc., 1988.
[20] T. Salimans, D. Kingma, and M. Welling, “Markov chain Monte Carlo
and variational inference: Bridging the gap,” in Proceedings of the 32nd
International Conference on Machine Learning (ICML-15), 2015, pp.
1218–1226.
[21] C. Geyer, “Introduction to Markov chain Monte Carlo,” Handbook of
Markov Chain Monte Carlo, pp. 3–48, 2011.
[22] S. Geman and D. Geman, “Stochastic relaxation, Gibbs distributions,
and the Bayesian restoration of images,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, no. 6, pp. 721–741, 1984.
[23] IEEE Standard for Floating-Point Arithmetic, IEEE Std. 754-2008, Aug.
2008.
[24] E. C. Cherry, “Some experiments on the recognition of speech, with one
and with two ears,” The Journal of the Acoustical Society of America,
vol. 25, no. 5, pp. 975–979, 1953.
[25] “ITU-T Recommendation G.114: One-way transmission time,”
2003, Dept. Elect. Eng., Technion. [Online]. Available:
http://www.itu.int/rec/T-REC-G.114
[26] H. Holma and A. Toskala, LTE for UMTS: Evolution to LTE-advanced.
John Wiley & Sons, 2011.
[27] M. Kim, P. Smaragdis, G. G. Ko, and R. A. Rutenbar, “Stereophonic
spectrogram segmentation using Markov random fields,” in 2012 IEEE
International Workshop on Machine Learning for Signal Processing.
IEEE, 2012, pp. 1–6.
[28] G. G. Ko and R. A. Rutenbar, “A case study of machine learning
hardware: Real-time source separation using Markov random fields via
sampling-based inference,” in Acoustics, Speech and Signal Processing
(ICASSP), 2017 IEEE International Conference on. IEEE, 2017, pp.
2477–2481.
[29] G. G. Ko and R. A. Rutenbar, “Real-time and low-power streaming
source separation using Markov random field,” Journal of Emerging
Technologies in Computing Systems, 2018, to be published.
[30] J.-F. Cardoso, “Blind signal separation: statistical principles,” Proceed-
ings of the IEEE, vol. 86, no. 10, pp. 2009–2025, 1998.
79
[31] N. Roman, D. Wang, and G. J. Brown, “Speech segregation based on
sound localization,” The Journal of the Acoustical Society of America,
vol. 114, no. 4, pp. 2236–2252, 2003.
[32] O. Yilmaz and S. Rickard, “Blind separation of speech mixtures via time-
frequency masking,” IEEE Transactions on signal processing, vol. 52,
no. 7, pp. 1830–1847, 2004.
[33] D. Scharstein and R. Szeliski, “A taxonomy and evaluation of dense
two-frame stereo correspondence algorithms,” International journal of
computer vision, vol. 47, no. 1-3, pp. 7–42, 2002.
[34] R. Szeliski, R. Zabih, D. Scharstein, O. Veksler, V. Kolmogorov,
A. Agarwala, M. Tappen, and C. Rother, “A comparative study of en-
ergy minimization methods for Markov random fields with smoothness-
based priors,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 30, no. 6, pp. 1068–1080, 2008.
[35] Y. Y. Boykov and M. P. Jolly, “Interactive graph cuts for optimal
boundary & region segmentation of objects in N-D images,” in Pro-
ceedings Eighth IEEE International Conference on Computer Vision.
ICCV 2001, vol. 1, 2001, pp. 105–112 vol.1.
[36] V. Kolmogorov, “Convergent tree-reweighted message passing for energy
minimization,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 28, no. 10, pp. 1568–1583, 2006.
[37] D. Koller and N. Friedman, Probabilistic graphical models: principles
and techniques. MIT press, 2009.
[38] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood
from incomplete data via the EM algorithm,” Journal of the Royal Sta-
tistical Society. Series B (methodological), pp. 1–38, 1977.
[39] T. K. Moon, “The expectation-maximization algorithm,” IEEE Signal
Processing Magazine, vol. 13, no. 6, pp. 47–60, 1996.
[40] R. M. Neal and G. E. Hinton, “A view of the EM algorithm that jus-
tifies incremental, sparse, and other variants,” in Learning in Graphical
Models. Springer, 1998, pp. 355–368.
[41] T. Minka, “Expectation-Maximization as lower bound maximization,”
1998. [Online]. Available: https://tminka.github.io/papers/minka-em-
tut.pdf
[42] F. Dellaert, “The expectation maximization algorithm,” Georgia Insti-
tute of Technology, Tech. Rep., 2002.
80
[43] S. L. Graham, P. B. Kessler, and M. K. Mckusick, “Gprof: A call graph
execution profiler,” in ACM Sigplan Notices, vol. 17, no. 6. ACM, 1982,
pp. 120–126.
[44] J. N. Mitchell, “Computer multiplication and division using binary loga-
rithms,” IRE Transactions on Electronic Computers, no. 4, pp. 512–517,
1962.
[45] R. Gutierrez and J. Valls, “Low cost hardware implementation of loga-
rithm approximation,” IEEE Transactions on Very Large Scale Integra-
tion (VLSI) Systems, vol. 19, no. 12, pp. 2326–2330, 2011.
[46] E. Vincent, R. Gribonval, and C. Févotte, “Performance measurement
in blind audio source separation,” IEEE Transactions on Audio, Speech,
and Language Processing, vol. 14, no. 4, pp. 1462–1469, 2006.
[47] J. B. Dennis and D. P. Misunas, “A preliminary architecture for a basic
data-flow processor,” in ACM SIGARCH Computer Architecture News,
vol. 3, no. 4. ACM, 1975, pp. 126–132.
[48] J. R. Gurd, C. C. Kirkham, and I. Watson, “The Manchester prototype
dataflow computer,” Communications of the ACM, vol. 28, no. 1, pp.
34–52, 1985.
[49] D. E. Culler, “Dataflow architectures,” Annual review of computer sci-
ence, vol. 1, no. 1, pp. 225–253, 1986.
[50] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S. Pallett,
“DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM.
NIST speech disc 1-1.1,” NASA STI/Recon Technical Report n, vol. 93,
1993.
[51] D. Campbell, K. Palomaki, and G. Brown, “A Matlab simulation of
“shoebox” room acoustics for use in research and teaching,” Computing
and Information Systems, vol. 9, no. 3, p. 48, 2005.
[52] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu,
J. Hestness, D. R. Hower, T. Krishna, S. Sardashti et al., “The gem5
simulator,” ACM SIGARCH Computer Architecture News, vol. 39, no. 2,
pp. 1–7, 2011.
[53] S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and
N. P. Jouppi, “McPAT: an integrated power, area, and timing modeling
framework for multicore and manycore architectures,” in Proceedings of
the 42nd Annual IEEE/ACM International Symposium on Microarchi-
tecture. ACM, 2009, pp. 469–480.
81
[54] J. D. Bakos, “High-performance heterogeneous computing with the Con-
vey HC-1,” Computing in Science & Engineering, vol. 12, no. 6, pp.
80–87, 2010.
[55] T. M. Brewer, “Instruction set innovations for the Convey HC-1 com-
puter,” IEEE Micro, vol. 30, no. 2, pp. 70–79, March 2010.
[56] D. M. Blei and J. D. Lafferty, “Topic models,” Text mining: Classifica-
tion, Clustering, and Applications, vol. 10, no. 71, p. 34, 2009.
[57] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and
R. Harshman, “Indexing by latent semantic analysis,” Journal of the
American Society for Information Science, vol. 41, no. 6, p. 391, 1990.
[58] C. H. Papadimitriou, H. Tamaki, P. Raghavan, and S. Vempala, “Latent
semantic indexing: A probabilistic analysis,” in Proceedings of the sev-
enteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles
of Database Systems. ACM, 1998, pp. 159–168.
[59] T. Hofmann, “Probabilistic latent semantic analysis,” in Proceedings of
the Fifteenth Conference on Uncertainty in Artificial Intelligence. Mor-
gan Kaufmann Publishers Inc., 1999, pp. 289–296.
[60] T. Hofmann, “Probabilistic latent semantic indexing,” in Proceedings
of the 22Nd Annual International ACM SIGIR Conference on Research
and Development in Information Retrieval, ser. SIGIR ’99. New York,
NY, USA: ACM, 1999, pp. 50–57.
[61] T. Hofmann, “Unsupervised learning by probabilistic latent semantic
analysis,” Machine Learning, vol. 42, no. 1-2, pp. 177–196, Jan. 2001.
[62] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent Dirichlet allocation,”
Journal of Machine Learning Research, vol. 3, pp. 993–1022, Mar. 2003.
[63] C. Wang and D. M. Blei, “Collaborative topic modeling for recommend-
ing scientific articles,” in Proceedings of the 17th ACM SIGKDD Inter-
national Conference on Knowledge Discovery and Data Mining. ACM,
2011, pp. 448–456.
[64] N. Rasiwasia and N. Vasconcelos, “Latent Dirichlet allocation models
for image classification,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, vol. 35, no. 11, pp. 2665–2679, 2013.
[65] L. Liu, L. Tang, W. Dong, S. Yao, and W. Zhou, “An overview of topic
modeling and its current applications in bioinformatics,” SpringerPlus,
vol. 5, no. 1, p. 1608, 2016.
82
[66] L. Yao, D. Mimno, and A. McCallum, “Efficient methods for topic model
inference on streaming document collections,” in Proceedings of the 15th
ACM SIGKDD international conference on Knowledge discovery and
data mining. ACM, 2009, pp. 937–946.
[67] A. Smola and S. Narayanamurthy, “An architecture for parallel topic
models,” Proceedings of the VLDB Endowment, vol. 3, no. 1-2, pp. 703–
710, 2010.
[68] T. L. Griffiths and M. Steyvers, “Finding scientific topics,” Proceedings
of the National academy of Sciences, vol. 101, no. suppl 1, pp. 5228–
5235, 2004.
[69] F. Yan, N. Xu, and Y. Qi, “Parallel inference for latent dirichlet alloca-
tion on graphics processing units,” in Advances in Neural Information
Processing Systems, 2009, pp. 2134–2142.
[70] H. Zhao, B. Jiang, J. F. Canny, and B. Jaros, “Same but different: Fast
and high quality Gibbs parameter estimation,” in Proceedings of the
21th ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining, ser. KDD ’15. New York, NY, USA: ACM, 2015, pp.
1495–1502.
[71] J.-B. Tristan, J. Tassarotti, and G. Steele, “Efficient training of LDA on
a GPU by mean-for-mode estimation,” in Proceedings of the 32nd Inter-
national Conference on Machine Learning, ser. Proceedings of Machine
Learning Research, F. Bach and D. Blei, Eds., vol. 37. Lille, France:
PMLR, 07–09 Jul 2015, pp. 59–68.
[72] K. Li, J. Chen, W. Chen, and J. Zhu, “SaberLDA: Sparsity-aware learn-
ing of topic models on GPUs,” in Proceedings of the Twenty-Second In-
ternational Conference on Architectural Support for Programming Lan-
guages and Operating Systems, ASPLOS 2017, Xi’an, China, April 8-
12, 2017, 2017, pp. 497–509.
[73] A. Putnam, A. M. Caulfield, E. S. Chung, D. Chiou, K. Constantinides,
J. Demme, H. Esmaeilzadeh, J. Fowers, G. P. Gopal, J. Gray et al., “A
reconfigurable fabric for accelerating large-scale datacenter services,” in
2014 ACM/IEEE 41st International Symposium on Computer Architec-
ture (ISCA). IEEE, 2014, pp. 13–24.
[74] A. M. Caulfield, E. S. Chung, A. Putnam, H. Angepat, J. Fowers,
M. Haselman, S. Heil, M. Humphrey, P. Kaur, J.-Y. Kim et al., “A
cloud-scale acceleration architecture,” in Microarchitecture (MICRO),
2016 49th Annual IEEE/ACM International Symposium on. IEEE,
2016, pp. 1–13.
83
[75] D. Clark, “Intel Complete Acquisition of Altera,” December 2015.
[Online]. Available: http://www.wsj.com/articles/intel-completes-
acquisition-of-altera-1451338307
[76] D. Newman, A. Asuncion, P. Smyth, and M. Welling, “Distributed algo-
rithms for topic models,” Journal of Machine Learning Research, vol. 10,
no. Aug, pp. 1801–1828, 2009.
[77] Z. Liu, Y. Zhang, E. Y. Chang, and M. Sun, “Plda+: Parallel latent
Dirichlet allocation with data placement and pipeline processing,” ACM
Transactions on Intelligent Systems and Technology (TIST), vol. 2,
no. 3, p. 26, 2011.
[78] A. Ahmed, M. Aly, J. Gonzalez, S. Narayanamurthy, and A. Smola,
“Scalable inference in latent variable models,” in International Confer-
ence on Web Search and Data Mining (WSDM), 2012, pp. 123–132.
[79] Y. Wang, X. Zhao, Z. Sun, H. Yan, L. Wang, Z. Jin, L. Wang, Y. Gao,
C. Law, and J. Zeng, “Peacock: Learning long-tail topic features for
industrial applications,” ACM Transactions on Intelligent Systems and
Technology (TIST), vol. 6, no. 4, p. 47, 2015.
[80] J. Yuan, F. Gao, Q. Ho, W. Dai, J. Wei, X. Zheng, E. P. Xing, T.-Y.
Liu, and W.-Y. Ma, “LightLDA: Big topic models on modest computer
clusters,” in Proceedings of the 24th International Conference on World
Wide Web. ACM, 2015, pp. 1351–1361.
[81] X.-H. Phan and C.-T. Nguyen, “GibbsLDA++: A C/C++ implemen-
tation of latent Dirichlet allocation (LDA),” 2007.
[82] Xilinx, “Xilinx Virtex UltraScale FPGA VCU110 Development Kit,”
2017. [Online]. Available: https://www.xilinx.com/products/boards-
and-kits/dk-u1-vcu110-g.html
[83] C. De Sa, K. Olukotun, and C. Ré, “Ensuring rapid mixing and low bias
for asynchronous Gibbs sampling,” in Proceedings of the 33rd Interna-
tional Conference on Machine Learning - Volume 48. JMLR, 2016, pp.
1567–1576.
[84] D. J. Wilkinson, “Parallel Bayesian computation,” Statistics Textbooks
and Monographs, vol. 184, p. 477, 2006.
[85] J. Gonzalez, Y. Low, A. Gretton, and C. Guestrin, “Parallel gibbs sam-
pling: From colored fields to thin junction trees,” in Proceedings of
the Fourteenth International Conference on Artificial Intelligence and
Statistics, 2011, pp. 324–332.
84
[86] M. Greenberger, “Notes on a new pseudo-random number generator,”
J. ACM, vol. 8, no. 2, pp. 163–167, Apr. 1961.
[87] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery,
Numerical Recipes 3rd Edition: The Art of Scientific Computing, 3rd ed.
New York, NY, USA: Cambridge University Press, 2007.
[88] J. Y. F. Tong, D. Nagle, and R. A. Rutenbar, “Reducing power by
optimizing the necessary precision/range of floating-point arithmetic,”
IEEE Transactions on Very Large Scale Integration (VLSI) Systems,
vol. 8, no. 3, pp. 273–286, 2000.
[89] F. Fang, T. Chen, and R. A. Rutenbar, “Lightweight floating-point
arithmetic: Case study of inverse discrete cosine transform,” EURASIP
Journal on Applied Signal Processing, vol. 2002, no. 1, pp. 879–892,
2002.
[90] Y. Lee, Y. Choi, S.-B. Ko, and M. H. Lee, “Performance analysis of bit-
width reduced floating-point arithmetic units in FPGAs: A case study
of neural network-based face detector,” EURASIP Journal on Embedded
Systems, vol. 2009, no. 1, p. 1, 2009.
[91] J. Von Neumann, “Probabilistic logics and the synthesis of reliable or-
ganisms from unreliable components,” Automata Studies, vol. 34, pp.
43–98, 1956.
[92] W. G. Brown, J. Tierney, and R. Wasserman, “Improvement of
electronic-computer reliability through the use of redundancy,” IRE
Transactions on Electronic Computers, no. 3, pp. 407–416, 1961.
[93] I. Koren and S. Y. H. Su, “Reliability analysis of N-modular redundancy
systems with intermittent and permanent faults,” IEEE Transactions on
Computers, no. 7, pp. 514–520, 1979.
[94] N. Vaidya and D. K. Pradhan, “Fault-tolerant design strategies for high
reliability and safety,” IEEE Transactions on Computers, vol. 42, no. 10,
pp. 1195–1206, 1993.
[95] E. P. Kim and N. R. Shanbhag, “Soft N-modular redundancy,” IEEE
Transactions on Computers, vol. 61, no. 3, pp. 323–336, 2012.
85
