A GPU implementation of the Simulated Annealing Heuristic for the
  Quadratic Assignment Problem by Paul, Gerald
ar
X
iv
:1
20
8.
26
75
v1
  [
cs
.D
C]
  1
3 A
ug
 20
12
A GPU implementation of the Simulated Annealing Heuristic for the
Quadratic Assignment Problem
Gerald Paul
Center for Polymer Studies and Dept. of Physics, Boston University
Boston, Massachusetts 02215
Abstract
The quadratic assignment problem (QAP) is one of the most difficult combinatorial optimization problems.
An effective heuristic for obtaining approximate solutions to the QAP is simulated annealing (SA). Here
we describe an SA implementation for the QAP which runs on a graphics processing unit (GPU). GPUs
are composed of low cost commodity graphics chips which in combination provide a powerful platform for
general purpose parallel computing. For SA runs with large numbers of iterations, we find performance
≈ 50 − 100 times better than that of a recent non-parallel but very efficient implementation of SA for the
QAP.
Keywords: quadratic assignment problem, simulated annealing, graphics processing unit
1. Introduction
Recently, relatively inexpensive commodity
graphics chips produced in large volumes to pro-
vide basic graphics support on personal computers
have been combined into powerful graphics process-
ing units (GPUs) which can be used for general pur-
pose parallel processing [9]. Here we describe the
implementation of the simulated annealing heuris-
tic on such a GPU.
Originally formulated by Koopmans and Beck-
mann [12], the QAP is NP-hard and is considered
to be one of the most difficult problems to be solved
optimally. It was defined in the following context:
A set ofN facilities are to be located atN locations.
The quantity of materials which flow between facil-
ities i and j is Aij and the distance between loca-
tions i and j is Bij . The problem is to assign to
each location a single facility so as to minimize the
cost
C =
N∑
i=1
N∑
j=1
Ai,jBp(i),p(j), (1)
where p(i) represents the location to which i is as-
signed.
Email address: gerryp@bu.edu (Gerald Paul )
There is an extensive literature that addresses the
QAP and which is reviewed in Refs. [1, 6, 10, 13, 15,
4]. The QAP is exceedingly hard to solve optimally.
With the exception of specially constructed cases,
optimal algorithms have solved only relatively small
instances with N ≤ 36. Therefore, various heuris-
tic approaches have been developed and applied to
problems typically of size N ≈ 100 or less. In con-
trast, a travelling salesman problem consisting of
almost 25,000 towns in Sweden has been solved ex-
actly [2].
2. Background
2.1. Simulated Annealing
The simulated annealing heuristic was first ap-
plied to the QAP by Burkhard and Rendl [5] and
was refined by Connolly [7]. The heuristic consists
of swapping locations of two facilities. Proposed
swaps can either be determined randomly or se-
lected according to some sequential enumeration of
all possible swaps. For each proposed swap, the
change in cost, δ, for the potential swap is calcu-
lated. The swap is made if
δ < 0 or
e−δ/T > r, (2)
Preprint submitted to Elsevier August 14, 2012
where T is an analog of temperature in physical sys-
tems that is slowly decreased according to a spec-
ified cooling schedule after each iteration and r is
a uniformly distributed random variable between 0
and 1. Randomly making moves which increase the
cost is done to help escape from local minima.
In traditional implementations of the heuristic,
the cost of making a swap is calculated from scratch
when the swap is considered in order to determine
if the swap should be made.
2.2. Recent efficient non-parallel implementation
Recently an implementation of SA for the QAP
was developed which improves performance over
the traditional implementation by factors of up to
100 for large numbers of iterations on large prob-
lem instances [17]. The approach taken is to at-
tempt to reduce the complexity by maintaining a
matrix ∆ of costs of each possible swap. This ap-
proach is motivated by the work of Taillard [18]),
who applied a similar approach in the application
of another heuristic, tabu search, to the QAP. The
∆-matrix approach is as follows:
(a) Initialize by creating a matrix ∆i,j containing
the cost of swapping i and j for all i and j,
given a current assignment p. Set iteration
number to zero.
(b) Increment the iteration number. Retrieve the
cost ∆r,s of the next possible swap (r, s).
(c) If the cost of the possible swap does not meet
the criteria in Eq. (2), go to (b).
(d) If the proposed swap meets the swap criteria,
update p to reflect the swap, update ∆i,j with
the new swap costs given the new permutation,
and go to (b).
(e) End after I iterations.
The number of iterations in which a swap is per-
formed divided by the total number of iterations is
known as the acceptance rate a(I). The lower the
acceptance range, the more efficient the ∆-matrix
approach.
In addition to providing higher performance,
even in a non-parallel environment, than the tra-
ditional SA implementation, the ∆-matrix imple-
mentation lends itself to GPU processing because,
as explained in Section 3, the costs of all swaps are
calculated together at the same time.
2.3. Need for parallelization
Ideally if many processors are available to run a
heuristic, the most efficient use of these processors
is to ran copies of the heuristic independently on
each processor. However, as shown in [17], statis-
tically, the larger the number of SA iterations, the
higher the quality of the solution found. Thus if
high quality solutions are desired, an SA run with
a high number of iterations can not be replaced
by many independent SA runs each with a smaller
number of iterations; a parallel implementation is
therefore required to allow timely completion of SA
runs with a high number of iterations.
2.4. Previous work on parallel heuristics for the
QAP
James et al. [11] propose a parallel tabu search
algorithm for the QAP and review previously pro-
posed parallel heuristics for the QAP most of which
involve the tabu search heuristic. Recently it was
shown that if high quality solutions are desired, sim-
ulated annealing performs better than tabu search
[16]. We are not aware of any implementations of
the simulated annealing heuristic for the QAP on
a GPU. Simulated annealing presents a complexity
not present in tabu search. Instead of calculating
all possible swaps and then choosing one, a swap is
made when the first swap meeting the requirement
of Eq. (2) is found.
Crainic and Toulous [8] classify parallel heuristic
methods into three categories. Our implementation
falls into their Type 1 category — the paralleliza-
tion of computationally expensive low level opera-
tions.
2.5. Characteristics of graphics processing unit
We have implemented the SA heuristics on the
Nvidia C2070 GPU. The important features of the
NVIDIA GPU architecture are [14]:
• A graphics processing unit contains one or
more symmetric multiprocessors(SMPs). Each
SMP contains multiple processing elements
each of which executes a sequential thread.
Threads are organized into warps of 32 threads
each. One or more warps are further grouped
into a block.
• Threads within a block can be synchronized
with a barrier synchronization GPU primitive.
2
• Code which runs on the GPU is contained in
a kernel. There is a relatively large overhead,
on the order of microseconds, associated with
launching a kernel.
• Each SMP contains a small amount of shared
memory which all of the processing elements
can access with relatively low latency. For the
GPU on which we implemented our program
shared memory is 48 KB.
• All SMPs can access a large device memory but
with high latency on the order of 100 times
longer than access to shared memory; for the
GPU on which we implemented our program
device memory is 1.3 TB. Data must be trans-
ferred from the host CPU memory to device
memory before it can be processed by an SMP.
• When a warp executes an instruction that ac-
cesses device memory, it coalesces the mem-
ory accesses within the warp into one or more
memory transactions for fixed length aligned
segments(128 bytes for the C2070) in device
memory. The greater the coalescence (the
fewer the memory transactions) the better the
performance.
3. Approach
In the ∆-matrix approach, on which we base our
GPU implementation, the bulk of the processing
time is spent (i) updating the matrix ∆ and (ii)
passing through the matrix elements to identify a
swap which meets the criteria of Eq. (2). On the
GPU we implement these steps in the algorithm as
follows:
(i) We update ∆ using multiple threads in mul-
tiple blocks. Each thread calculates elements ∆i,j
with a given i and a range of j. If the range of
j is too small, the setup overhead to process these
elements will be relatively large. For this reason,
we limit the number of threads that are used to
evaluate ∆ with the constraint that each thread
processes at least 16 matrix elements.
(ii) The search for the next potential swap which
satisfies Eq. (2) is also performed with multiple
threads. Starting at the next possible swap after
the last swap performed, multiple threads evalu-
ate a subset of all elements of the ∆-matrix. The
size of the subset is an adjustable parameter. If
one or more valid swaps are found in the subset,
the swap which would have been found first if the
swaps were evaluated sequentially is chosen. If no
valid swap is found, the next subset is treated. The
process is repeated until a valid move is found. If
the subset size is too small, before a valid swap is
found many subsets must be processed with some
attendant overhead. If the subset is too large, more
swaps than need be are evaluated since many valid
swaps will be found. We found good results when
the subset size is determined by the requirement
that each thread processes 16 matrix elements.
In order to implement the algorithm efficiently on
the Nvidia GPU the following must be taken into
account:
(i) With the exception of small problem instances
(N . 100. The matrices A,B and ∆ are too
large to be stored in shared memory and must
be stored in device memory. As noted above
in Sec. 2.5, efficient access to device mem-
ory is achieved when data in contiguous blocks
of 128 bytes is accessed simultaneously by all
threads in a warp. In updating the ∆-matrix,
terms of the form Ai,jBp(i),p(j) must be evalu-
ated. Because it is not possible to control the
permutation p, it is not possible to meet the
requirement for efficient access when updating
the ∆-matrix without a modification in the ap-
proach. For this reason, we maintain a matrix
B′, the elements of which reflect swaps which
have been performed. In conjunction with a
swap of facilities r and s, after updating p we
update the matrix B′
B′i,j ← B
′
p(i),p(j) (3)
for columns and rows of B′ with index r or
s. This is equivalent to swapping rows r and
s and swapping columns r and s. The oper-
ation is of complexity O(N) and is performed
efficiently using N threads making an insignif-
icant contribution to the processing time. Us-
ing B′ instead of B in the updating of ∆, the
terms to be evaluated are of the form Ai,jB
′
i,j
and the requirement for efficient data access
can be met. The insight that the calculation
can be transformed in this way is one of the
keys to good performance.
(ii) Because all elements of ∆ are updated at
the same time, we are able to use the lim-
ited amount of shared memory accessible by
the SMPs to improve performance of updat-
ing ∆ as follows. After a swap of facilities
3
r and s, but before updating ∆, we load
Ar,k, As,k, B
′
r,k, and B
′
s,k for all k into shared
memory requiring only O(N) bytes of shared
memory. No other elements of A and B′ are
are required for the update of elements ∆i,j for
i and j not equal to r or s; other elements of
∆ do require some A and B′ device memory
access.
(iii) There are points in the processing (e.g. af-
ter the identification of the next swap, after
the updating of B′, and after the updating
of ∆) at which a given thread must access a
data element updated by another thread (pos-
sibly in another block). In order to ensure
that the data updating is complete before the
data is used, a synchronization mechanism is
needed. Threads within the same block can
be synchronized with the hardware provided
barrier synchronization mechanism. No such
mechanism is provided for synchronization be-
tween threads in different blocks. For this rea-
son, we have implemented the synchronization
mechanism for threads in all blocks proposed in
[19]. When synchronizing among blocks, there
is a potential for deadlock situations. Deadlock
can occur when a block is queued for process-
ing on an SMP while another block is process-
ing on that SMP. For this reason, we limit the
number of blocks in a kernel to be no greater
than the number of SMPs. The alternative of
completing all GPU processing when synchro-
nization is required and then re-launching the
GPU kernel is not feasible because of the asso-
ciated kernel launch overhead.
(iv) The highest performance is achieved when as
many threads as possible are executing simul-
taneously. Using the ∆-matrix approach all
the swaps costs are updated at the same place
in the implementation, providing a straightfor-
ward opportunity to use many threads.
4. Numerical Results
We perform numerical experiments on a family of
random QAP instances created in the same manner
as the Taixxa instances [18] in QAPLIB [3]: the A
and B matrices are symmetric with zero diagonal;
the matrix elements are chosen from independent
uniform distributions. These same instances were
analyzed in Ref. [17]. Because the GPU version im-
plements the identical heuristic as the non-parallel
implementation, the performance improvement is
not dependent on the nature of the instances used
to measure performance and it is sufficient to com-
pare performance on one family of instances.
The non-parallel version, written in C++ and de-
veloped in Ref. [16], was run on an Intel Xeon 2.4
GHz processor. The GPU version was run on a
Nvidia Tesla C2070 GPU with processing elements
with clock speed of 1.15 GHz. The GPU contains
14 SMPs each of which contains 32 processing ele-
ments. The GPU version is written in C++ with
Nvidia CUDA language extensions [14].
In Fig. 1, for various problem sizes, we plot
run time versus iterations per run, I, for the ef-
ficient non-parallel implementation and our paral-
lel GPU based implementation. We note that the
GPU based implementation becomes more efficient
than the non-parallel implementation for I ≈ 104.
Until the iteration number becomes ≈ 106, the
non-parallel method does not use the ∆-matrix ap-
proach because the acceptance rate of swaps is high
(see [17] where the role of acceptance rate is dis-
cussed in detail). Our parallel implementation uses
the delta matrix for all values of I and therefore suf-
fers when the acceptance rate is high as is the case
for low values of I. This effect is also seen in Fig.
2 in which we plot the performance improvement
P =
tnon−parallel
tparallel
(4)
versus I where tnon−parallel and tparallel are the run
times for the non-parallel and parallel implementa-
tions respectively. For all problem sizes studied, P
increases with increasing number of iterations until
about I ≈ 107 at which point P reaches a plateau;
the additional time contribution from the initial 106
iterations in the parallel implementation then be-
comes negligible.
5. Discussion and Summary
Because
• our parallel implementation provides perfor-
mance improvements of up to a factor of 100
versus the non-parallel implementation of the
∆-matrix approach, and
• the non-parallel ∆-matrix approach provides a
performance improvement of up a factor of 100
versus the traditional implementation [17],
4
the current parallel implementation provides a re-
markable performance improvement of up to 10,000
versus the traditional implementation which has
been in use essentially unchanged for over two
decades.
We emphasize that we make no change to the
original SA heuristic to accommodate the GPU en-
vironment — only to its implementation. Thus our
GPU implementation of the heuristic provides ex-
actly the same results, statistically, as does the tra-
ditionally implemented version.
For more details on the GPU implementation we
refer the reader to C++/CUDA code provided as
supplementary material.
Acknowledgments
We thank Dan Kamalic for technical assistance
in using Boston University’s eng-grid GPU enabled
computer systems and the Defense Threat Reduc-
tion Agency (DTRA) for support.
References
References
[1] Anstreicher K. Recent advances in the solution
of quadratic assignment problems. Math. Program
2003;97: 27–42.
[2] Applegate DL, Bixby RE, Chvata V, Cook WJ. The
Traveling Salesman Problem: A Computational Study
Princeton University Press (2007).
[3] Burkard RE, Cela E, Karish SE, Rendl F. QAPLIB — a
quadratic assignment problem library. J. Global Optim.
1997;10: 391403. http://www.seas.upenn.edu/qaplib/.
[4] Burkard RE, Dell’ Amico M, Martello S. Assignment
Problems SIAM Philadelphia (PA). Chapters 7-9.
[5] Burkhard RE, Rendl FA. thermodynamically motivated
simulation procedure for combinatorial optimization
problems. EJOR 1984;17:169-174.
[6] C¸ela E. The Quadratic Assignment Problem: Theory
and Algorithms (Kluwer, Boston, 1998).
[7] Connolly DT. An improved annealing scheme for the
QAP, European J. Oper. Res. 1990;46:93100.
[8] Crainic TG, Toulouse M. Parallel strategies for meta-
heuristics,Handbook of Metaheuristics, F. Glover and
G. Kochenberger, eds., Kluwer Academic Publishers.
(2003).
[9] Fatahalian K, Houston M. GPUs: A Closer Look, ACM
Queue - GPU Computing. 2008;6:18-29.
[10] James T, Rego C, Glover F. Multistart tabu search and
diversification strategies for the quadratic assignment
problem. IEEE Tran. on Systems, Man, and Cybernet-
ics Part A: Systems and Humans 2009;39:579–596.
[11] James T, Rego C, Glover F. A cooperative parallel tabu
search algorithm for the quadratic assignment problem,
European Journal of Operational Research 2009;195:
810-826.
[12] Koopmans T, Beckmann M. Assignment problems
and the location of economic activities. Econometrica.
1957;25:53–76.
[13] Loiola EM, et al. A survey for the quadratic assignment
problem. European Journal of Operational Research.
2007;176:657–690.
[14] Nvidia CUDA C Programming Guide Version 4.0,
Nvidia Corp, (2011); www.nvidia.com.
[15] Pardalos PM, Rendl F, Wolkowicz H. The quadratic as-
signment problem: A survey and recent developments.
In Pardalos PM, Wolkowicz H. eds. Quadratic Assign-
ment and Related Problems: DIMACS Series on Dis-
crete Mathematics and Theoretical Computer Science,
16 (Amer. Math. Soc., Baltimore, MD, 1994),1–42.
[16] Paul G. Comparative performance of tabu search
and simulated annealing heuristics for the quadratic
assignment problem. Operations Research Letters
2010;38:577-581.
[17] Paul G. An efficient implementation of the sim-
ulated annealing heuristic for the quadratic as-
signment problem, submitted for publication;
http://arxiv.org/abs/1111.1353.
[18] Taillard, E`. Robust taboo search for the quadratic as-
signment problem. Parallel Comput.1991;17:443–455.
[19] Xiao S, Feng W. Inter-block GPU communication
via fast barrier synchronization. IEEE International
Symposium on Parallel & Distributed Processing
(IPDPS),2010:1-12.
5
Figure 1: Run time versus number of iterations per run for
random instances for non-GPU implementation (circles) and
GPU implementation (squares).
6
Figure 2: Performance improvement versus number of itera-
tions per run
7
