Logically Synthesized, Hardware-Accelerated, Restricted Boltzmann
  Machines for Combinatorial Optimization and Integer Factorization by Patel, Saavan et al.
LETTER
Logically Synthesized, Hardware-Accelerated,
Restricted Boltzmann Machines for
Combinatorial Optimization and Integer
Factorization
S. Patel,1 P. Canoza,1 S. Salahuddin1
1Department of Electrical Engineering and Computer Sciences, University of
California, Berkeley, California 94720, USA
The Restricted Boltzmann Machine (RBM) is a stochastic neural
network capable of solving a variety of difficult tasks such as NP-
Hard combinatorial optimization problems and integer factoriza-
tion. The RBM architecture is also very compact; requiring very
few weights and biases. This, along with its simple, paralleliz-
able sampling algorithm for finding the ground state of such prob-
lems, makes the RBM amenable to hardware acceleration. How-
ever, training of the RBM on these problems can pose a significant
challenge, as the training algorithm tends to fail for large problem
sizes and efficient mappings can be hard to find. Here, we pro-
pose a method of combining RBMs together that avoids the need
to train large problems in their full form. We also propose meth-
ods for making the RBM more hardware amenable, allowing the
algorithm to be efficiently mapped to an FPGA-based accelerator.
Using this accelerator, we are able to show hardware accelerated
factorization of 16 bit numbers with high accuracy with a speed
improvement of 10000x and a power improvement of 32x.
As the demand for processing of big data accelerates, new archi-
tectures and computing paradigms are receiving heightened attention.
Many of the most challenging computational problems fall into the cat-
egory of NP-Hard and NP-Complete, where no exact polynomial time
solution exists. For this reason, these problems would benefit the great-
est from acceleration through novel architectures on FPGAs, ASICs,
and new devices. 1, 2 In this context, the Ising Problem has long been
known to be in the class of NP-Hard problems. Because of this, a large
class of combinatorial optimization problems can be reformulated as
Ising problems and solved by finding the ground state of that system 3–5.
The Boltzmann Machine 6 was originally introduced as a constraint sat-
isfaction network based on the Ising model problem, where the weights
would encode some global constraints, and stochastic units were used
to escape local minima. The original Boltzmann Machine found fa-
vor as a method to solve various combinatorial optimization problems
7. However, convergence of the sampling and training schemes on the
Boltzmann Machine has been slow. 8
To this point, the Restricted Boltzmann Machine (RBM), which
is a special form of the more generic Boltzmann Machine, provides a
scalable hardware architecture by eliminating intra-layer connections,
while maintaining the ability to fully approximate a probability distri-
bution over binary variables 8. Nonetheless, even for the RBM, the
convergence of the Markov Chain Monte Carlo (MCMC) algorithm,
which is typically used for sampling, has proved to be challenging 9, 10.
On the other hand, the structure of the RBM lends itself to hardware
acceleration. Therefore, just in the same way the continued advance-
ment of the Moore’s Law has enabled a revolution in the deep learn-
ing community, it is conceivable that hardware acceleration will make
it possible to overcome the sampling problem in the RBM and allow
rapid solutions of combinatorial optimization problems.
In this work, we present an end-to-end RBM implementation that
combines advances in training, model quantization and an efficient
hardware implementation for inference to demonstrate substantial ac-
celeration over standard CPU and GPU implementations. First, we pro-
pose a generative model composed of multiple learned modules that is
able to solve a larger problem than the individually trained parts. This
allows for circumventing the problem of training large modules, thus
minimizing training time and enabling a higher degree of model ac-
curacy. The advancement in training is accomplished through a novel
merging procedure, which is based on the idea of training two models
with overlapping states and then combining them along the intersection
states. This is similar to combining circuits in a digital logic, with the
output of one circuit sharing the same state as the input of the next. We
have combined this method of training large models with algorithmic
improvements that allow for compressed weights and therefore enables
the use of lower precision representation to develop an efficient FPGA
based accelerator. Altogether, the RBM shows approximately a 104X
speed acceleration and 32X power improvement for a 16 bit integer fac-
torization problem, which is equivalent to solving a 232 phase space, in
comparison to implementing the same algorithm in the CPU or GPU.
Algorithm Details The RBM is a binary stochastic neural net-
work, which can be understood as a Markov random field of Bernoulli
random variables divided into a bipartite graph structure, with the two
layers called the visible states and the hidden states, graphically demon-
strated in Figure 1 A). We denote v as the visible state, h as the hidden
states, and E(v, h) as the energy associated with those states. The
probability assigned to a given state p(v, h) = 1
Z
e−E(v,h) where
Z =
∑
v,h e
−E(v,h) is the normalizing constant of the distribution.
The weight matrix W and biases a and b create connections between
the hidden and visible layers and create the probability distribution.
The bipartite structure means the visible and hidden state can be
factored as p(v|h) =∏i p(vi|h) and p(h|vi) =∏j p(hj |v) due to the
conditional independence of states within the same layer. Sampling on
the RBM is performed via Block Gibbs Sampling 8, 11, where the units
in each layer are sampled in parallel. In Figure 1 B) we show how the
RBM preferentially move to higher probability states and stochastically
moves through the state space. Each unit has a sigmoidal activation
1
ar
X
iv
:2
00
7.
13
48
9v
1 
 [c
s.L
G]
  1
6 J
un
 20
20
with activation probability p(vi = 1|h) = σ(wTi h + bi) and σ(x) =
(1 + e−x)−1.
Merging RBMs The difficulty of training large RBMs means that new
innovations are necessary at the training step. In this work, we pro-
pose merging smaller models, that are already trained, to form an ini-
tial condition for larger models as a way of improving the training
of large RBMs. This methodology is inspired from the digital logic
where larger functions can be constructed by combining small func-
tional blocks 12. Notably, all NP-hard problems can be formulated
through the Boolean Satisfiability problem. Therefore, constructing the
RBM in the aforementioned way provides a natural approach to solv-
ing hard optimization problems. An example of this is shown in Figure
1 F) where we solve a toy example of a 3SAT Boolean Satisfiability
problem. This shows that this method has the representational power
to solve a wide variety of NP-Hard problems.
Merging is performed by combining RBMs along their common
visible neuron connection. We show the mechanics of the merging
process in Figure 1 C) and D) where we combine digital logic gates
together in this manner. Merging across the visible neurons like this
retains the bi-partite, product of experts nature of the RBM while giving
the expected distribution if we were combining gates to perform logical
synthesis. Using this as inspiration, we construct adders and multipliers
as shown in Figure 1 E) to combine trained n-bit adder and multiplier
units into 2n-bit multiplier units. Detailed information of this merging
protocol has been provided in the Methods section.
After smaller models have been trained and merged, we retrain the
larger models to fine-tune them. As shown in Methods, Equation 11,
the models are good approximations for the correct distribution we are
interested in, and provide a good initial conditions for training of the
final model. Importantly, merging in this way retains the bi-directional
nature of the network. The same RBM can be queried to solve what the
output is for a given input (the “forward” direction), and queried what
outputs caused a given input (the “reverse” direction). In figure 2 A),
B), and C) we see the consequence of the bi-directional nature where
the same model can perform multiplication, division and factorization.
We additionally show in those figures that by merging and retraining
we get significantly better performance than training alone or merging
alone.
FPGA AccelerationThe massive parallelism present in the RBM
algorithm makes it especially efficient on the FPGA. The RBM algo-
rithm also doesn’t contain any branches or explicit memory accesses
while sampling, removing expensive branch misprediction cycles and
DRAM fetch cycles. Furthermore, unlike other deep neural network
accelerators, this algorithm is not memory bandwidth limited for any
of its operation 13 as can be seen by the FPGA utilization table (Table
1 in Supplementary Section), further increasing the algorithmic perfor-
mance on hardware. The bipartite nature means that many neurons can
be sampled in parallel on the FPGA, allowing us to perform each neu-
ron activation probability in parallel. There has been much work on
accelerating RBM training through FPGA implementations 14–16, but
by focusing on inference only, we reduce the necessary hardware re-
quirements to the essential components, fully unlocking the inherent
parallelism in the network architecture.
Model Quantization In addition to taking advantage of the inher-
ent parallelism, to fit larger models on the FPGA we have performed
model quantization to be able to lower precision during the inference.
There has been much work on model quantization for deep neural net-
works, however most of them focus on Convolutional Neural Networks
and Multi-Layer Perceptrons 17–19, which cannot be directly used for
RBMs. We developed a scheme for model quantization, explicitly for
RBMs, which is accomplished via 2 additional training steps. The first
is adding a constraint to the maximum value of the weight and retrain-
ing with this constraint. This makes sure that the weight magnitude
cannot overflow the fixed point representation that we are trying to ac-
complish. As demonstrated in Figure 2 D), this does not change the
overall accuracy of the model, as the retraining causes more weights of
smaller magnitude to compensate for a single weight of large magni-
tude. The second retraining step adds an extra quantization loss term
to the training step (see Eqn. 12 in Methods). In Figure 2 E) we show
that we can quantize from 32 bit floating point to 6 bit fixed point with-
out large loss in error. Figure 2 F) shows the final weights distribution
after these two quantization training steps, demonstrating how the re-
training preferentially pushes weights towards their post-quantization
values. The exact training steps are detailed in the Methods section.
Matrix Multiplication with Mask With the use of a fixed point rep-
resentation of the model parameters and our restriction of binary node
values, the matrix multiplication step of the algorithm is greatly sim-
plified. Instead of instantiating many binary multiplication circuits, we
can perform the computation by passing the weight values through a bi-
nary mask, or a series of 2-to-1 muxes, created by the node values. We
combine this with the use of fixed point rather than floating point cal-
culations, and we see a large decrease in area cost and FPGA resource
usage. The estimated area cost of 32 bit floating point multiplication
is 27x that of an 8 bit multiplication, and an 8 bit multiplication is 8x
costlier than an 8 bit adder. This shows that the benefits of getting rid
of the multiplications is very large, allowing more of the calculation to
happen on the FPGA in one cycle 20.
Sigmoid Approximation Exact calculation of the sigmoidal activation
function f(x) = 1
1+e−x is computationally expensive. To accomplish
direct calculation, at least 3 extra hardware instructions are needed, ex-
ponentiation, addition, and division, which all incur a large hardware
cost both in terms of latency and area. Instead, binary sigmoid values
are precomputed and enumerated in a look up table (LUT) for use in
the FPGA. This implementation allows for fast evaluation of the ac-
tivation function without expensive hardware resources. After matrix
multiplication and bias addition, the computed value is passed through
the LUT based activation function to approximate the sigmoid.
Pseudo-Random Number Generation We have found that the qual-
ity of the random numbers is not important to the algorithm working
effectively. The only necessary condition was that the random numbers
were not correlated between individual neurons. To accomplish this
we use the relatively cheap 32 bit length Linear Feedback Shift Reg-
ister (LFSR) pseudo random number generator to create high quality
random numbers. Each neuron has its own LFSR and is seeded with a
different value to minimize the possibility of correlation. Longer sam-
pling runs (> 108 samples) would cause the neuron LFSRs to start
looping, and correlate samples at the start of the run with later samples
but this can be simply mitigated by increasing the length of the LSFR.
Results and Discussion
Trained Model Performance By merging together small RBMs using
the principles of logic synthesis, many complex and NP-Hard problems
can be solved using the bi-directional nature of RBMs. In Figure 1
D) we see the mechanics of this procedure, while Figure 1 E) shows
how it can apply to integer factorization and Figure 1 F) shows how
it can apply to 3SAT and boolean satisfiability. These problems are at
the heart of many computationallly difficult problems. We additionally
show how this type of training can outperform directly trained models.
Figure 2 A), B) and C) show that merging smaller models and retraining
them drastically outperforms directly training the model, with errors
close to 3-10x less for problems of interest.
To take advantage of the parallel resources of the FPGA and un-
lock hardware performance, we have shown generalizable methods for
quantization that can be applied to all RBM problem instances. This
is shown in figure 2 D), E) and F) where we show that adding a max
2
weight constraint followed by quantization loss while training can lead
to better performance once quantized on the FPGA. By showing sig-
nificant model performance down to 6 bit integer representations, we
demonstrate that this approach allows for hardware efficient model rep-
resentations for RBMs.
Scaling We have demonstrated scaling of the factorization algorithm
up to 16 bit numbers. Markov Chain Monte Carlo based sampling
methods for optimization problems fall into the class of "Stochastic
Local Search" and are expected to have exponential scaling with prob-
lem size 21. This exponential scaling dependence is shown in Figures
3 C) and D). Although this is the case, we see a 104 constant fac-
tor speed increase when the algorithm is implemented on the FPGA
in Figure 3 B) in 16 bit factorization and Figure 3 E) across all prob-
lem instances. This massive speed increase across the whole spectrum
of bit sizes has real world consequences, as it implies that other algo-
rithms mapped onto this general framework can become very efficient
in finding ground state solutions which would otherwise be difficult to
obtain.
The FPGA implementation of our sampling algorithm has shown
a 104 speed increase and a 32x power decrease on the sampling algo-
rithm. This is compared to a dual CPU machine, running the industry
standard PyTorch machine learning framework, which is highly opti-
mized. Although our implementation takes up much of the resources
of the FPGA, there are many possible areas where our design could
be modified to scale for performance. Our goal was not to create the
most optimized hardware design, but to demonstrate that parallel hard-
ware running our very hardware friendly algorithm had the potential for
drastic improvement. With focused effort on the improvement of the
hardware architecture 22–24 the speed and performance improvement is
expected to get much larger.
Although we demonstrate sampling speed increase for inference,
using the same FPGA accelerated sampling algorithm can also work
for decreasing training time. The major bottleneck in the training step
is creating a series of uncorrelated samples, which takes a large number
of samples for a highly correlated sampler. Using FPGA acceleration of
the sampling algorithm could give a lower variance estimate of model
probabilities in a much faster and energy efficient manner, than those
provided by a CPU or GPU.
Time Domain Analysis Markov Chain Samplers (and specifically the
Gibbs Sampler we are using) have been proven to converge in a ge-
ometric rate with the number of samples. In addition, the distance in
total variation between the sampled distribution and the model distribu-
tion strictly decreases with each Gibbs sampling step 25. This implies
that the quality of our solution should increase as the sampled distribu-
tion approaches equilibrium. This is especially useful for optimization
problems where run time can be traded directly for better solutions.
The time domain sampler also shows a large difference between
time taken to model the full distribution (the mixing time) and the time
taken to sample the correct factors once (the hitting time)(see Fig. 4 A
and B in comparison to Figure 3 C and D). A demonstrative example
of this phenomenon is shown directly in Figure 4 C), with the differ-
ence in times being close to 4x. By adding in a sample verification
methodology, the time taken to identify correct factors can be drasti-
cally decreased. For NP-Hard problems, the solution can be verified in
polynomial time but finding a given solution is done within exponential
time, a feature which is also present in this factorization problem. This
means the relative overhead of checking of factors is relatively low, and
can be done on a regular basis to reduce the sampling time significantly.
This method of adding heuristics to the sampler is present in many dif-
ferent problem types, and must be done differently for each type of
problem. Here we demonstrate both that the sampler approaches the
correct distribution, and that effective heuristics using the hitting time
are available to reduce the time to solution.
ConclusionsOur choice of looking at an integer factorization prob-
lem stems from two motivations: (i) as we are exploring a new training
methodology, the factorization problem gives access to large amount of
training data without worrying about quality and (ii) at the same time
integer factorization is also an example of a hard problem. It should
be noted, nonetheless, that many combinatorial optimization problems
can be broken down into associated sub-problems, and solved using
a greedy approach (i.e. using the nearest neighbor approach in the
Travelling Salesman Problem, or multiplying single digits in a larger
multiplication, or evaluating one Boolean logic statement in a Boolean
satisfiability problem as shown in Figure 1). Using a greedy approach
(such as the one used in the travelling salesman problem with nearest
neighbors) can produce non-optimal results, and evaluating all permu-
tations of possible solutions to find the most optimal one can be com-
putationally intractable in a large problem space. By combining these
sub-problems using the method proposed here, we bypass the prob-
lems associated with those two approaches. We encode possible solu-
tions as a probability indicating its local optimality, and combine these
sub problems by merging the visible units of their RBMs. This com-
bination mechanism multiplies the probabilities such that the solution
with global optimality is encoded as the mode of the distribution mod-
eled by the larger RBM. In addition, as the phase space of the problem
space is 2v where v is the number of visible units, we can also encode
a large problem space using minimal units and decrease the hardware
cost of the approach. As the Boolean satisfiability problem (shown by
the 3SAT example) can be mapped to solving a variety of NP-Hard
problems 26, 27, it is expected that variants of this RBM framework can
be applied to a large variety of graph based optimization tasks such as
MAX-CUT, the Travelling Salesman Problem, Quadratic Assignment
problem and any problem that can be mapped to a series of binary vari-
ables.
Much work has been done to develop accelerators for the Ising
Model problem, such as specialized ASICs 28–30, FPGA designs 31, 32,
Memristor based accelerators 33, 34, Quantum Mechanical Accelerators
based on quantum adiabatic processes 35, Optical Parametric Oscilla-
tors 36, 37, Magnetic Tunnel Junction 12, 38, 39 and many others. Our work
is distinct from these existing results in that, rather than focusing on
a specific emerging hardware, we demonstrated that, by combining a
novel methodology to construct large models from smaller building
blocks with a hardware implementation that exploits the intrinsically
parallel architecture and sparsity of the RBM, orders of magnitude im-
provement in the speed and power is achievable for a sufficiently large
problem. At the same time, our approach could also benefit from the
emerging hardware proposed in the aforementioned reports for further
improvement in speed and energy. Further scaling can also be accom-
plished by moving to Deep Boltzmann machines 40, 41, which may be
easier to train. While we demonstrated an inference problem, the un-
derlying method is equally applicable and expected to accelerate train-
ing problems 20, 24, 42. Substantial acceleration of generative models
such as the RBM could lead to unsupervised, life-long, learning ma-
chines.
3
v1 v2 v3 v4 v5
h1 h2 h3 h4
5
4
4x5
A) B)
C)
AND
OROR OR OR
Out
h1 h2
A A
v1 v2 v3
A AA
h2
h1
v1
v2
v3
A
A
A
A
A h2
h1
v1
v2
v3
O
O
O
O
O
h1 h2
O O
v1 v2 v3
O OO
h1 h2
A A
v1 v2 v3
A AA
h2
h1
v1
v2
v3
A
A
A
A
A +
h1 h2
O O
v1 v2 v3
O OO
h2
h1
v1
v2
v3
O
O
O
O
O
+
=
=
h2
h1
v2
v3
O
O
O
O
h2
h1
v1
v2
A
A
A
A v3/1
A/O
h1 h2
A A
v1 v2 v3/1
A A/OA
h1 h2
O O
v2 v3
OO
D)
E)
F)
x
+A
dd
M
ul
t
Mult
n n
2n
nn
Mult Mult
Mult Mult
Add Add
Add AddAdd
Figure 1: . Demonstration of RBM structure and sampling algorithm
(A) Structure of the RBM neural network. The Restricted Boltzmann Machine is a binary neural network structured in a bipartite graph structure.
This bipartite structure makes sampling parallelizable, as all the visible nodes can be sampled in parallel given the hidden node states (and vice
versa) (B) The RBM maps out the non-convex state space of a probability distribution. Low energy states map to high probability states which
the network identifies through a Markov Chain Monte Carlo (MCMC) algorithm. These low energy states correspond to correct answers for the
problem being mapped onto the RBM. (B) A graphical mapping of RBMs to gate level digital circuits. The visible nodes correspond to the
inputs and outputs of the logic gate, and the hidden nodes are the internal representation of the logic gate. (D) Graphical Demonstration of the
merging procedure, showing how two RBMs which represent an AND gate and an OR gate can be merged together to form a connection, similar
to what would be done when synthesizing digital logic. (E) Using this idea, we can create arbitrary adders and multipliers by merging together
smaller units to create the logical equivalent of larger units. We use the same methods developed in digital circuits to put together these larger
units. The leftmost image shows how we create a 2n bit multiplier using n bit multiplications and n bit additions. The color coding shows how the
partial products are broken apart amongst the adders and multipliers. To the right of that we show how we perform 4, n-bit input 2n-bit output
multiplications, and then accumulate the result. This method uses 4 multipliers and 5 adders and scales polynomially with the number of bits on
the output. (F) Using this strategy of merging logical units to solve a simple 3SAT, Combinatorial Optimization problem. We note that the logic
gates formed in part (c) can be operated in reverse to find solutions to the given satisfiability problem with 3SAT. This is the canonical example
of an NP-Hard problem, which maps directly to all other NP-Hard problems26, 27. This shows that these networks have the representational power
to solve a variety of NP-Hard problems.
4
100 101 102 103 104 105 106 107
# Samples
0.0
0.2
0.4
0.6
0.8
1.0
p c
or
re
ct
16 Bit Multiplication Problem
Trained 16 bit
Merged 8 bit
Retrained 8 bit
100 101 102 103 104 105 106 107
# Samples
0.0
0.2
0.4
0.6
0.8
1.0
p c
or
re
ct
16 Bit Division Problem
Trained 16 bit
Merged 8 bit
Retrained 8 bit
100 101 102 103 104 105 106 107
# Samples
0.0
0.2
0.4
0.6
0.8
1.0
p c
or
re
ct
16 Bit Factorization Problem
Trained 16 bit
Merged 8 bit
Retrained 8 bit
100 101 102 103 104 105 106 107
# Samples
0.0
0.2
0.4
0.6
0.8
1.0
p c
or
re
ct
Factorization Performance after 
 Retraining with Max Weight Constraint
No Retraining
w/Max Weight Constraint
100 101 102 103 104 105 106 107
# Samples
0.0
0.2
0.4
0.6
0.8
1.0
p c
or
re
ct
Factorization Performance after 
 Retraining with Quantization Loss
32 bit Floating Point
6 bit, No Retraining
6 bit, with Retraining
4 2 0 2 4 6
Weight Value
100
101
102
103
104
Fr
eq
ue
nc
y
Weights distribution before and 
 after Quantization Retraining
Before
After
0 50 100 150 200 250
Factor
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
Pr
ob
ab
ilit
y
CPU Factorization of 191 x 223 = 42593
0 50 100 150 200 250
Factor
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
Pr
ob
ab
ilit
y
FPGA Factorization of 181 x 241 = 43621
A) B) C)
D) E) F)
G) H)
Figure 2: Performance on 16 Bit Multiplication, Division and Factorization
(A), (B), (C) Showing the performance on multiplication, division, and factorization. The same model trained to do multiplication can operate
in reverse and perform division and factorization tasks, as it learns the full joint distribution over variables. In these tasks, the model must get
the exact solution as the mode of the sampled distribution. We sample for the given number of samples and check whether the mode of the
sampled distribution corresponds to the correct answer to the problem of interest. (D) Showing effect of retraining with a maximum weight
constraint. Here we see no performance degradation due to retraining the module by adding an extra constraint on the maximum value of any
weight. (E) Retraining the network with added L1 quantization loss. We see that by taking the maximum weight constraint network and naively
applying 6 bit quantization without retraining, the network performs very poorly. However, if we retrain for quantization, then only we see a
large performance increase. This provides an avenue to effectively quantize RBM weights to a lower bit representation. (F) A histogram of the
RBM weights before and after retraining for quantization. We see the network is strongly clustered around the 6 bit values, which is caused by
the extra quantization loss term. This extra term also allows the network to fix the error due to quantization with contrastive divergence training.
(G) An example of a factorized distribution showing factorization of a 16 bit number into its two prime factors. The sampled distribution shows
clear peaks at the two correct answers to the factorization problem. The problem shown here, the factorization of a semi-prime number into
its two co-primes, is at the heart of the RSA cryptography algorithm and is the basis of most of modern encryption systems. (H) An example
probability mass function showing what the sampling procedure would return after running on the FPGA for 107 samples. Although this shows
that there is greater error in the incorrect factors as compared to part (G), there are still two clear peaks in the distribution indicating the correct
factors. The error is primarly caused by quantization and other approximations performed on the FPGA. This increase in error is accompanied
by massive increases in speed and power efficiency.
5
101 103 105 107
Samples
0.0
0.2
0.4
0.6
0.8
1.0
p c
or
re
ct
FPGA vs. CPU Sample Efficiency
Accelerator
FPGA
CPU
10 7 10 5 10 3 10 1 101 103
Time (s)
FPGA vs. CPU Time Efficiency
Accelerator
FPGA
CPU
4 6 8 10 12 14 16
Factorization Bits
100
101
102
103
104
105
106
Sa
m
pl
es
 to
 g
iv
en
 a
cc
ur
ac
y
CPU Scaling of Sampling Algorithm
Accuracy
0.1
0.3
0.5
0.7
4 6 8 10 12 14 16
Factorization Bits
FPGA Scaling of Sampling Algorithm
Accuracy
0.1
0.3
0.5
0.7
4 6 8 10 12 14 16
Factorization Bits
10 6
10 5
10 4
10 3
10 2
10 1
100
101
102
103
Ti
m
e 
(s
) t
o 
70
%
 a
cc
ur
ac
y
Time Scaling of FPGA compared to CPU
Accelerator
FPGA
CPU
A) B)
C) D)
E)
Figure 3: Performance of the FPGA implementation vs the CPU implementation on factorization
The sampling algorithm scales approximately exponentially with the bit size (and approximately linearly with the phase space). Stochastic Local
Search and Markov Chain Monte Carlo algorithms are known to have this exponential scaling dependency. We see a 104 speed improvement
across all model sizes. (A) The sample efficiency of the FPGA implementation is similar to the CPU implementation, even after quantizing to
8 bit weights and biases and using the various approximation schemes detailed. (B) When the time taken to reach a solution is scaled for the
FPGA vs. the CPU, the FPGA outperforms the CPU by a factor of 104, showing the massive speed improvements possible. This is done at a
70Mhz clock speed, compared to the much faster 2.4Ghz of the CPU. (C) The scaling of the algorithm when measured at various accuracy levels
on the CPU. The RBM for each bit number is run until it hits the given accuracy on a set of random factorization problems. (D) Scaling of the
sampling algorithm when run on the FPGA. The difference in sample number from part (C) is due to approximations necessary to efficiently port
the model onto the FPGA. In general, the FPGA takes a constant factor more samples than the CPU at a given bit count. (E) Time scaling of the
factorization problem measured at the 70% accuracy level. We see that the FPGA performs 4 orders of magnitude faster across all bit counts for
the outlined sampling algorithm.
6
0 200000 400000 600000 800000 1000000
Samples
0
5
10
15
20
Nu
m
be
r o
f h
its
Factorization of 179*211=37769, 
 Hitting Time (1000 instances)
0 20000 40000 60000 80000 100000
Samples
0
2
4
6
8
Nu
m
be
r o
f h
its
0 200000 400000 600000 800000 1000000
Samples
0
200
400
600
800
Cu
m
ul
at
iv
e 
Nu
m
be
r o
f h
its
Factorization of 179*211=37769, 
 Cumulative Hitting Time (1000 instances)
0
100
200
Fa
ct
or
 1
     First Hit of 
 Correct Factors
Mode Correctly 
    IdentifiedTime Series Analysis 
 Factorization of 179 * 211 = 37769
0 200000 400000 600000 800000 1000000
Sample Number
0
100
200
Fa
ct
or
 2
A) B)
C)
Figure 4: Time Domain Analysis of 16 Bit Factorization Algorithm
(A) Hitting time histogram for the factorization of a 16 bit number. The experiment is repeated 1000 times, and aggregate results are shown for
this. This shows that most hits happen fairly early, with a long tail to the distribution. On the bottom panel, when zooming into the hits that
occur in the first 100000 samples, we see that there are a large number of hits that occur in the first bin. (B) Histogram showing the cumulative
distribution for the hitting time on the 16 bit factorization task. The smoothness of this distribution can allow us to fit an empirical model for
the samplers and effectively parallelize by starting many samplers. (C) A trace plot showing the time domain behavior of the sampler. The
sampler stochastically explores the state space, but preferentially remains in the minimal energy state. This means that the first hit of the correct
factors occurs much quicker than the minimal energy/maximal probability state is correctly identified. This can lead to effective early stopping
heuristics if checking of the solution can be done quickly compared to the cost of producing further samples.
7
METHODS
Model Training In this paper we trained the RBMs by contrastive diver-
gence as described by 8. Each of the models in this paper were validated
by checking their performance on the problem they were trying to solve
(i.e addition and subtraction for an adder, multiplication and factorization
for a multiplier). This method was also used to assess model complexity
(i.e number of hidden units) and evaluate learning parameters (learning rate,
batch size, etc.). The final results for RBM sizes and approximate training
times are shown in Table 2. Models were trained using the PyTorch library
in Python, which is one of the standard libaries used in Machine Learning
research and production code.
Training was conducted on a computer with 2 Intel Xeon E5-2620 pro-
cessors, and 2 Nvidia Titan V GPUs. Each RBM was trained until the test
error stopped decreasing, and the model had converged to a best solution.
Training rate was varied across models and tuned as a hyperparameter. The
training set was copies of all possible multiplication or addition problems,
encompassing the full state space. In the case that the state space was too
large to train on all of it (such as the 16 bit and 32 bit adders), a random sam-
ple of the training set was used, and a new random sample was reinitialized
every epoch of training.
The training time tends to increase with the number of bits in the adder
or multiplier due to both the size of the data set (which increases exponen-
tially with the number of bits) and the number of hidden and visible units
(which both increase approximately linearly).
At the 16 bit adder level, the size of the data set was so large that the
entire data set could not be used for training (≈ 8 billion data points) and
a randomized sample of the set had to be taken. As generalization is not
perfect, we can attribute the decrease in their performance to this fact. For
the 32 bit adder, this problem was exacerbated, and the 32 bit adder was
outperformed by most units even after training for a full week.
For multipliers, the 8 bit multiplier has a tractable amount of data, but a
good joint density model could not be formed even after a large training time.
We believe this is due to an inherent difficult in the multiplication problem
that is not present in the addition problem. The multiplication model has
a higher number of hidden units as there is not as distinct of a correlation
between higher level bits in the 8 bit multiplication problem as there is in
the addition problem, first level correlations (as an RBM with 1 layer of
hidden units would find) are more difficult to find. We believe that using
deep boltzmann machines might help fix the problem of training in large
multipliers.
Merging RBMs Given two RBMs we wish to merge along a common con-
nection (see Figure 1) with the following parameters: WA ∈ Rn×r and
WB ∈ Rm×s, visible biases bA ∈ Rn and bB ∈ Rm, and hidden biases
aA ∈ Rr and aB ∈ Rs. The energies and probabilities of these are as
follows:
EA(v, h) = −vTWAh− aTAh− bTAv; pA(v, h) =
1
ZA
e−EA(v,h)
EB(v, h) = −vTWBh− aTBh− bTBv; pB(v, h) =
1
ZB
e−EB(v,h)
(1)
We can write the weight matrices as a series of row vectors correspond-
ing to one visible unit’s connections to a set of hidden units.
WA =
 w
A
1
...
wAn
 , WB =
 w
B
1
...
wBm
 , (2)
With this definition, the merge operation is shown below. If unit k
of RBM A is merged with unit l of RBM B the associated weight ma-
trix WA+B ∈ R(n+m−1)×(r+s) , visible bias vA+B ∈ Rn+m−1
and hidden bias hA+B ∈ Rr+s dictate the probabilities and energies for
the merged RBM. Merging multiple units between these two RBMs corre-
sponds to moving multiple row vectors from WB to WA, which creates
the associated decrease in dimensionality of WA+B and bA+B (where
WA+B ∈ R(n+m−d)×(r+s) and vA+B ∈ Rn+m−d where d is the
number of merged units.
WA+B =
WA
0
wBl
0
0 WB\l
 =

wA1
...
wAk
...
wAn
0
wBl
0
0
wB1
...
wBl−1
wBl+1
...
wBm

(3)
bA+B =

bA1
...
bAk + b
B
l
...
bAn
bB1
...
bBl−1
bBl+1
...
bBm

aA+B =

aA1
...
aAr
aB1
...
aBs

(4)
Below, we show how this relates to the original energies and proba-
bilities. The vectors v and h correspond to the visible vector put into the
combined RBM, while vA, vB , hA and hB correspond to the equivalent
state vectors that would be inputted into the single RBMs. Using these equa-
tions, we can see that the combined RBM energy factorizes into a sum of
the original RBM energies and the probability is the product of the original
probabilities.
v =
 v1...
vn+m−1
 h =
 h1...
hr+s
 (5)
vA =

v1
...
vl
...
vn

, vB =

vn+1
...
vn+l−1
vl
vn+l
...
vn+m−1

, (6)
hA =
h1...
hr
hB =
hr+1...
hr+s
 (7)
EA+B(v, h) = EA(vA, hA) + EB(vB , hB), (8)
pA+B(v, h) =
1
ZA+B
e−EA+B(v,h) ∝ pA(vA, hA)pB(vB , hB) (9)
Because of the probabilities approximately multiplying (Eq. 10), we can
also say that if each of the distributions differed from the “ideal” distribu-
tion (denoted here by q), then we can expect the error (as measured by the
8
KL divergence eqn. 11) to increase approximately linearly with the number
of distributions summed together. As contrastive divergence learning ap-
proximately follows the gradient of the KL divergence, the merged model
represents a good initial condition for training of the larger model 43. This
means that only small corrections in CD training are needed on the merged
model to create a good trained model for the larger network. Training the
merged model is possible as intermediate nodes are represented as extra visi-
ble units, and can be calculated based on the input and outputs of the dataset.
The data for the merged models can be calculated as if we were propagat-
ing values through a digital circuit and keeping track of intermediate values,
which become the data for the merged model to be trained on.
p = pA(vA, hA)pB(vB , hB), q = qA(vA, hA)qB(vB , hB) (10)
DKL(p‖q) ≈ DKL(pA‖qA) +DKL(pB‖qB); (11)
Retraining for Quantization To train for quantization, the loss function
that is optimized for (L(W,d)) is modified so that in addition to having the
regular Contrastive Divergence loss between the weights W and the data d
denoted by CD(W,d), we have a loss term that pushes weights to be closer
to their quantized value. The hyperparameter λ is slowly increased during
training to force the weights progressively closer to their quantized value.
This method allows the contrastive divergence term to fix the errors created
by quantizing the weights slowly while training. Although taking the exact
gradient of this loss term is not possible (as Q(W ) is not a smooth function
of W , see Eqn. 13), we find that by assuming the quantization gradient
∂Q(W )
∂W
≈ 0, we obtain a sufficiently good performance.
L(W,d) = CD(W,d)− λ||W −Q(W )||1 (12)
∂L(W,d)
∂W
= 
∂CD(W,d)
∂W
− λsign(W −Q(W ))(1− ∂Q(W )
∂W
)
(13)
FPGA Programming At the heart of the FPGA is the RBM computing
core which performs the Gibbs sampling algorithm. All programming was
done using the Xilinx Vivado suite on the on the Xilinx Virtex UltraScale+
XCVU9P-L2FLGA2104. The core was designed to output the most sam-
ples for the RBM sizes we had. The weights and biases are stored in on-chip
SRAM to decrease access time. The values are broadcast to the node up-
date modules each cycle, which performs the necessary operations for the
sampling and take up the bulk of the computation resources. There are no
pipeline or data hazards, removing the need for any complex timing schemes.
Thus, if we instantiate a node update module for every node register, there is
a new sample taken from the visible node registers every clock cycle, taking
full advantage of the RBM’s parallelism.
Each node update module contains the logic to perform a matrix-row
multiplications, a sigmoid function, and a comparison with a random num-
ber. The matrix-row multiplication is performed via a binary mask and an
adder module that accumulates the surviving weight values. Once each
weight value is masked appropriately, they are passed into an adder tree
which accumulates the results of each multiplication. The accumulators take
the majority of LUT resources on the FPGA, and represent the bottleneck
in the computation. In our implementation, we use single cycle accumula-
tion, but multi-cycle accumulation is possible to save on hardware resources
while scaling up for larger RBM sizes. A fixed point sigmoid function is
implemented as a LUT, which allows for speed without expensive hardware
operations. A Python script generates the LUT Verilog code in order to test
different bit lengths and fixed point locations. Finally, a linear feedback
shift register (LFSR) generates a pseudo-random number. The number is
compared to the output of the sigmoid LUT and the relevant node is updated
with the boolean result.
Results from the bank of visible node registers is buffered to an IO con-
troller with a FIFO. The IO controller also uses a memory-mapped interface
that can program the weights, clamps, and biases. The controller commu-
nicates to our desktop via PCIe. A simple PCIe link is provided through
a Xillybus IP Core 44 which provides up to 800 MB/s data transfer rate.
This is a sufficient speed to get all of the sample data off of the FPGA, but
faster speeds are possible for future implementations. To handle the large
data stream from the bus, a C backend was created to serve data to existing
Python code for RBM analysis. This C backend is used for analysis of data
and to judge the quality of the solution on the FPGA. This FPGA pipeline
provides an efficient method for solving the problems of interest, where the
limiting factor in computation speed can become the FPGA sampling speed.
9
References
1. Colwell, R. The chip design game at the end of Moore’s law. In 2013
IEEE Hot Chips 25 Symposium, HCS 2013 (Institute of Electrical and
Electronics Engineers Inc., 2013).
2. Waldrop, M. M. More Than Moore. Nature 530, 144–147 (2016).
3. Barahona, F. On the computational complexity of Ising spin glass models.
Journal of Physics A: Mathematical and General 15, 3241 (1982).
4. Kirkpatrick, S., Gelatt, C. D. & Vecchi, M. P. Optimization by simulated
annealing. Science 220, 671–680 (1983).
5. Lucas, A. Ising formulations of many NP problems. Frontiers in Physics
2, 1–14 (2014).
6. Ackley, D. H., Hinton, G. E. & Sejnowski, T. J. A learning algorithm for
boltzmann machines. In Cognitive Science, vol. 9, 147–169 (Morgan
Kaufmann, San Francisco (CA), 1985).
7. Korst, J. H. & Aarts, E. H. Combinatorial optimization on a Boltzmann ma-
chine. Journal of Parallel and Distributed Computing 6, 331–357 (1989).
8. Hinton, G. E. Training Products of Experts by Minimizing Contrastive
Divergence. Neural Computation 14, 1771–1800 (2002).
9. Tieleman, T. Training restricted boltzmann machines using approxima-
tions to the likelihood gradient. In Proceedings of the 25th International
Conference on Machine Learning, 1064–1071 (ACM, 2008).
10. Tieleman, T. & Hinton, G. Using fast weights to improve persistent con-
trastive divergence. In ACM International Conference Proceeding Series,
vol. 382, 1033–1040 (ACM Press, 2009).
11. Geman, S. & Geman, D. Stochastic Relaxation, Gibbs Distributions, and
the Bayesian Restoration of Images. Readings in Computer Vision 564–
584 (1987).
12. Camsari, K. Y., Faria, R., Sutton, B. M. & Datta, S. Stochastic p -Bits for
Invertible Logic. Physical Review X 7, 031014 (2017).
13. Jouppi, N. P. et al. In-Datacenter Performance Analysis of a Tensor Pro-
cessing Unit. 44th International Symposium on Computer Architecture
(ISCA), (2017).
14. Ly, D. L. & Chow, P. A high-performance FPGA architecture for Restricted
Boltzmann Machines. In Proceedings of the 7th ACM SIGDA Interna-
tional Symposium on Field-Programmable Gate Arrays, FPGA’09, 73–82
(2009).
15. Kim, S. K., McAfee, L. C., McMahon, P. L. & Olukotun, K. A highly scal-
able restricted Boltzmann machine FPGA implementation. In FPL 09:
19th International Conference on Field Programmable Logic and Appli-
cations, 367–372 (2009).
16. Kim, S. K., McMahon, P. L. & Olukotun, K. A large-scale architecture for
restricted Boltzmann machines. In Proceedings - IEEE Symposium on
Field-Programmable Custom Computing Machines, FCCM 2010, 201–
208 (2010).
17. Han, S., Mao, H. & Dally, W. J. Deep compression: Compressing deep
neural networks with pruning, trained quantization and Huffman coding.
4th International Conference on Learning Representations, ICLR 2016 -
Conference Track Proceedings (2016).
18. Ullrich, K., Welling, M. & Meeds, E. Soft weight-sharing for neural net-
work compression. In 5th International Conference on Learning Rep-
resentations, ICLR 2017 - Conference Track Proceedings (International
Conference on Learning Representations, ICLR, 2019).
19. Chen, W., Wilson, J. T., Tyree, S., Weinberger, K. Q. & Chen, Y. Com-
pressing neural networks with the hashing trick. In 32nd International
Conference on Machine Learning, ICML 2015, vol. 3, 2275–2284 (2015).
20. Dally, W. High-Performance Hardware for Machine Learning. In Nips
2015 (2015).
21. Hoos, H. H. & Stützle, T. Stochastic Local Search (Elsevier, 2005).
22. Li, B., Najafi, M. H. & Lilja, D. J. An FPGA implementation of a Restricted
Boltzmann Machine classifier using stochastic bit streams. In Proceed-
ings of the International Conference on Application-Specific Systems, Ar-
chitectures and Processors, vol. 2015-Septe, 68–69 (2015).
23. Ly, D. L. & Chow, P. A multi-FPGA architecture for stochastic Restricted
Boltzmann Machines. FPL 09: 19th International Conference on Field
Programmable Logic and Applications 168–173 (2009).
24. Lo, C. & Chow, P. Building a multi-FPGA virtualized Restricted Boltz-
mann Machine architecture using embedded MPI. In ACM/SIGDA Inter-
national Symposium on Field Programmable Gate Arrays - FPGA, 189–
198 (2011).
25. Brémaud, P. Gibbs Fields and Monte Carlo Simulation. In Markov Chains,
253–322 (Springer New York, New York, NY, 1999).
26. Cook, S. A. & A., S. The complexity of theorem-proving procedures. In
Proceedings of the third annual ACM symposium on Theory of computing
- STOC ’71, 151–158 (ACM Press, New York, New York, USA, 1971).
27. Karp, R. M. Reducibility among Combinatorial Problems. In Complexity
of Computer Computations, 85–103 (Springer US, 1972).
28. Yamaoka, M. et al. A 20k-Spin Ising Chip to Solve Combinatorial Opti-
mization Problems With CMOS Annealing. IEEE Journal of Solid-State
Circuits 51, 303–309 (2016).
29. Boyd, J. Silicon chip delivers quantum speeds [News]. IEEE Spectrum
55, 10–11 (2018).
30. Schneider, C. R. & Card, H. C. Analog CMOS Deterministic Boltzmann
Circuits. IEEE Journal of Solid-State Circuits 28, 907–914 (1993).
31. Belletti, F. et al. Janus: An FPGA-based system for high-performance
scientific computing. Computing in Science and Engineering 11, 48–58
(2009).
32. Ko, G. G., Chai, Y., Rutenbar, R. A., Brooks, D. & Wei, G. Y.
Flexgibbs: Reconfigurable parallel gibbs sampling accelerator for struc-
tured graphs. In Proceedings - 27th IEEE International Symposium on
Field-Programmable Custom Computing Machines, FCCM 2019, 334
(Institute of Electrical and Electronics Engineers Inc., 2019).
33. Bojnordi, M. N. & Ipek, E. Memristive boltzmann machine: A hardware
accelerator for combinatorial optimization and deep learning. In IEEE
International Symposium on High Performance Computer Architecture,
1–13 (2016).
34. Wan, W. et al. A 74 TMACS/W CMOS-RRAM Neurosynaptic Core with
Dynamically Reconfigurable Dataflow and In-situ Transposable Weights
for Probabilistic Graphical Models. In Digest of Technical Papers - IEEE
International Solid-State Circuits Conference, vol. 2020-February, 498–
500 (Institute of Electrical and Electronics Engineers Inc., 2020).
35. Dridi, R. & Alghassi, H. Prime factorization using quantum annealing and
computational algebraic geometry. Scientific Reports 7, 43048 (2017).
36. Wang, Z., Marandi, A., Wen, K., Byer, R. L. & Yamamoto, Y. Coher-
ent Ising machine based on degenerate optical parametric oscillators.
Physical Review A - Atomic, Molecular, and Optical Physics 88, 063853
(2013).
37. McMahon, P. L. et al. A fully programmable 100-spin coherent Ising ma-
chine with all-to-all connections. Science (New York, N.Y.) 354, 614–617
(2016).
38. Camsari, K. Y., Salahuddin, S. & Datta, S. Implementing p-bits with Em-
bedded MTJ. IEEE Electron Device Letters PP, 1–1 (2017).
39. Borders, W. A. et al. Integer factorization using stochastic magnetic tun-
nel junctions. Nature 573, 390–393 (2019).
40. Salakhutdinov, R. & Hinton, G. Deep Boltzmann machines. Journal of
Machine Learning Research 5, 448–455 (2009).
41. Salakhutdinov, R. & Larochelle, H. Efficient learning of Deep Boltzmann
Machines. In Journal of Machine Learning Research, vol. 9, 693–700
(2010).
42. Savich, A. W. & Moussa, M. Resource efficient arithmetic effects on
RBM neural network solution quality using MNIST. Proceedings - 2011
International Conference on Reconfigurable Computing and FPGAs, Re-
ConFig 2011 35–40 (2011).
43. Carreira-Perpiñán, M. Ã. & Hinton, G. E. On contrastive divergence learn-
ing. Tech. Rep., University of Toronto (2005).
44. Preußer, T. B. & Spallek, R. G. Ready PCIe data streaming solutions for
FPGAs. In Conference Digest - 24th International Conference on Field
Programmable Logic and Applications, FPL 2014 (Institute of Electrical
and Electronics Engineers Inc., 2014).
10
Acknowledgements This work was supported by ASCENT, one of six cen-
ters in JUMP, a Semiconductor Research Corporation (SRC) program spon-
sored by DARPA.
Author Contributions Model Synthesis and Analysis was performed by
S.P.; FPGA Programming was performed by S.P and P.C.; Manuscript was
co-wrote by S.P, P.C and S.S; S.S supervised the research. All authors con-
tributed to discussions and commented on the manuscript.
Competing Interests The authors declare that they have no competing fi-
nancial interests.
Correspondence Correspondence and requests for materials can be ad-
dressed to either S.P.
(saavan@berkeley.edu) or S.S. (sayeef@berkeley.edu).
11
Node update
connections
Weight
Array
Bias
Array
Memory
Controller
I/O Controller
PCIe bus
data_out
RBM
DESKTOP: C backend
FPGA
Hidden Node Registers
a) b)
Visible Node Registers
Adder Module
PRNG Sigmoid LUT
vn
<
hM
bn
Wn1
h1
0 10
WnM
0 10
Figure 5: (Supplementary) FPGA Architecture (A) Memory and compute hierarchy. The RBM consists of memory to hold the weight, bias,
and clamp values, registers to hold the node values, and circuitry to perform the node updates, which take up the bulk of the resources. The output
is buffered to the IO controller that communicates results to the PCIe bus. A C backend reads in the data stream from PCIe, and can program
the weights and biases from the memory controller. (B) Example of a node update connection. Given M hidden nodes, the figure depicts the
circuitry to update the nth visible node. The hidden nodes binary mask the nth row of the weight matrix. The results are accumulated in the
adder module and added to the nth visible bias. It is then passed through a sigmoid LUT and compared to the output of a PRNG to update the
value of the visible node.
12
RBM Size
(Vis x Hid)
LUT Usage
(Absolute)
LUT Usage
(%)
FF Usage
(Absolute)
FF Usage
(%)
Power
(W)
8x32 14811 1.25 15437 0.65 5.2
16x64 33564 2.84 23717 1.00 5.2
32x128 117931 9.98 52582 2.22 5.5
64x256 418694 35.42 159428 6.74 5.6
64x512 736800 62.32 270182 11.43 5.9
80x600 990417 83.77 431544 18.25 6.2
Table 1: (Supplementary) FPGA Utilization Utilization numbers for FPGA and various RBM sizes All usage numbers reported are for 8 bit
weights and biases . The usage shows that the FPGA is not memory limited for the problem sizes we are interested in, but compute limited, as
the LUT usage goes up much faster than the FF usage as the problem size grows. All weights and biases fit in on chip SRAM, allowing for fast
access and data reuse.
13
Model Visible Units Hidden Units
Training Time
(minutes)
1 bit Adder 5 6 1
2 bit Adder 8 28 13.5
4 bit Adder 14 64 133
8 bit Adder 26 96 201
16 bit Adder 50 128 321
32 bit Adder 98 192 13000 (approx.)
4 bit Multiplier 8 16 46
6 bit Multiplier 12 48 79
8 bit Multiplier 16 64 655
10 bit Multiplier 20 144 4063
12 bit Multiplier (Merged) 60 352 512
16 bit Multiplier 32 512 3611
16 bit Multiplier (Merged) 78 576 5156
Table 2: (Supplementary) Model sizes and training times for various RBMs As the size of the RBM grows, both the training time for model
convergence and the number of hidden units needed to model the distribution both increase. We find that the largest multiplier model that can be
fit on our version of the FPGA design is the 16 Bit Multiplier/Factorizer. Model analysis and training for models larger than the 16 bit multiplier
took too long to train, and are not displayed here.
14
