Abstract Discrete optimization is a central problem in artificial intelligence. The optimization of the aggregated cost of a network of cost functions arises in a variety of problems including Weighted Constraint Programs (WCSPs), Distributed Constraint Optimization (DCOP), as well as optimization in stochastic variants such as the tasks of finding the most probable explanation (MPE) in belief networks. Inference-based algorithms are powerful techniques for solving discrete optimization problems, which can be used independently or in combination with other techniques. However, their applicability is often limited by their compute intensive nature and their space requirements.
requiring only polynomial space. However, the practical efficiency of these methods relies on their ability to prune redundant or sub-optimal subtrees. Inference-based methods are inspired from dynamic programming (DP) techniques. These methods apply a sequence of transformations to reduce the problem size at each step while preserving its semantics. A well known inference-based approach is Bucket Elimination (BE) [15] . BE iterates over the variables of the problem, reducing the size of the problem at each step, by replacing a variable and its related cost functions with a single new function, derived by optimizing over the possible values of the eliminated variable. The Dynamic Programming Optimization Protocol (DPOP) [53] is one of the most efficient inference-based DCOP solvers, and it can be seen as a distributed version of BE, where agents exchange newly introduced cost functions via messages.
The importance of inference-based approaches arises in several optimization fields including constraint programming [4, 56] . For example, several propagators adopt DP-based techniques to establish constraint consistency. For instance, (1) the knapsack constraint propagator proposed by Trick applies DP techniques to establish arc consistency on the constraint [65] ; (2) the propagator for the regular constraint establishes arc consistency using a specific digraph representation of the DFA, which has similarities to dynamic programming [51] ; (3) the context free grammar constraint makes use of a propagator based on the CYK parser that uses DP to enforce generalized arc consistency [54] .
The main drawback of inference-based methods, including BE and DPOP, is that each transformation may introduce cost functions with large arities, requiring exponential time and space in a key structural parameter of a problem, called induced width. While inference-based approaches may not always be appropriate to solve discrete optimization problems, as their time and space requirements may be prohibitive, they may be very effective in problems with particular structures, such as problems where their underlying primal graphs have small induced widths or distributed problems where the number of messages is crucial for performance, despite the size of the messages. Additionally, approximated inference methods can be effectively used to derive lower bounds, which are important components of branch and bound algorithms, as they can be used to prune parts of the search space by detecting dominated solutions-i.e., solutions whose cost can provably not be lower than the best cost found so far. Mini-Bucket Elimination (MBE) is an approximated variant of BE that can be used for this purpose.
Recent developments on external-memory algorithms have shown that the use of large secondary data storage can be effective to extend the applicability of memory intensive approaches [22, 44, 37, 63] . However, the computational solving runtime remains a bottleneck.
To contrast this background, we note that the structure exploited by inferencebased approaches in constructing solutions makes it suitable to exploit a novel class of massively parallel platforms that are based on the Single Instruction Multiple Thread (SIMT) paradigm-where multiple threads may concurrently operate on different data, but are all executing the same instruction at the same time. The SIMT-based paradigm is widely used in modern Graphical Processing Units (GPUs) for general purpose parallel computing. Several libraries and programming environments (e.g., the Compute Unified Device Architecture (CUDA)) have been made available to allow programmers to exploit the parallel computing power of GPUs.
In this paper, we propose the design and implementation of both an exact and an approximated inference-based algorithm that exploits parallel computation using GPUs to solve WCSPs and DCOPs. Our proposal aims at employing GPU hardware to speed up the inference process, thus providing an alternative way to enhance the performance of inference-based discrete optimization approaches.
This paper makes the following contributions: (1) We propose a novel design and implementation of a centralized and a distributed exact inference-based algorithm, inspired by BE and DPOP, to optimally solve WCSPs and DCOPs, which harnesses the computational power offered by parallel platforms based on GPUs; (2) We introduce an approximated version of the GPU-based inference-based algorithm, inspired by MBE; (3) We report an extensive empirical analysis that shows significant improvements in performance with respect to the sequential CPU-based algorithms, reporting an average speedup of two order of magnitude; and (4) We show the generality of our approach through empirical evaluations on three different GPU architectures, all providing significant speedups. While the description of the techniques proposed in this paper focus on discrete optimization tasks, they also applies to other key problems in probabilistic graphical models, such as, MPE.
Background: Weighted Constraint Satisfaction Problems
A weighted constraint satisfaction problem (WCSP) [40, 61] is a tuple X, D, C , where X = {x 1 , . . . , x n } is a finite set of variables, D = {D x1 , . . . , D xn } is a set of finite domains for the variables in X, with D xi being the set of possible values for the variable x i , and C is a set of weighted constraints (or cost functions). A weighted constraint f i ∈ C is a function that maps tuples defined on the set of variables relevant to f i into R + ∪ {∞}, where ∞ is a special value denoting that a given combination of values is not allowed. The set of variables relevant to f i is referred to as the scope of f i , and denoted as x i ⊆ X. Formally, f i : xj ∈x i D xj → R + ∪ {∞}. 1 A solution is a value assignment for a subset ρ of variables from X that is consistent with their respective domains; i.e., it is a partial function θ : X → n i=1 D xi such that, for each x j ∈ X, if θ(x j ) is defined (i.e., x j ∈ ρ), then θ(x j ) ∈ D xj . The cost of an assignment ρ is the sum of the evaluation of the constraints involving all the variables in ρ. A solution is complete if it assigns a value to each variable in X and has finite cost (i.e., different from ∞). We will use the notation σ to denote a complete solution, and, for a set of variables V = {x i1 , . . . , x i h } ⊆ X, σ V = σ(x i1 ), . . . , σ(x i h ) is the projection of σ to the variables in V, where i 1 < · · · < i h . The goal of a WCSP is to find a complete solution σ * with minimal cost, i.e., σ * = argmin σ∈Σ fi∈C
where Σ is the state space, defined as the set of all possible complete solutions. 1 For simplicity, we assume that tuples of variables are built according to a predefined ordering.
Given a WCSP P , G P = (X, E C ) is the primal graph of P , where {x, y} ∈ E C iff ∃f i ∈ C such that {x, y} ⊆ x i . Given an ordering o on X, we say that a variable x i has lower priority w.r.t. a variable x j , denoted x i ≺ o x j , if x i precedes x j in o.
Definition 1 (Induced Graph, Induced Width [16] ) Given the primal graph G P and an ordering o on its nodes, the induced graph G * P on o is the graph obtained from G P by connecting nodes, processed in descending order of priority, to all their preceding neighbors. Processing a node x i results in the addition of edges connecting pairs of preceding neighbors of x i . Given a graph and an ordering of its nodes, the width of a node is the number of edges connecting it to its preceding nodes in the ordering. The induced width w * o of G P is maximum width over all nodes of G * P along the ordering o. Example 1 Fig. 1(a) illustrates the primal graph of a simple WCSP instance with 4 binary variables, x 1 , x 2 , x 3 , and x 4 , and 5 constraints, f (x 1 , x 2 ), f (x 1 , x 4 ), f (x 2 , x 3 ), f (x 2 , x 4 ), f (x 3 , x 4 ). Fig. 1(b) illustrates the constraints costs of the WCSP, which associate a cost value for each combination of values for the variables in the scope of the constraints. Fig. 1(c) shows the induced graph G * P obtained along the ordering o = x 1 , x 2 , x 3 , x 4 . Its induced width is 3.
Definition 2 (Pseudo-tree [17] ) Given a primal graph G P and an ordering o on its nodes, a DFS pseudo-tree arrangement for G P is a rooted directed tree T = X, E T of G P such that if f i ∈ C and {x, y} ⊆ x i , then x and y appear in the same branch of T . The root of T is the node associated to the variable with lower priority in o. Edges of G P that are in (resp. out of) E T are called tree edges (resp. backedges). The tree edges connect parent-child nodes, while backedges connect a node with its pseudo-parents and its pseudo-children. Fig. 1(d) shows one possible pseudo-tree T = X, E T associated to the primal graph shown in Fig. 1(a) , with E T = {f (x 1 , x 2 ), f (x 2 , x 3 ), f (x 3 , x 4 )}, and order o = x 1 , x 2 , x 3 , x 4 . The node labeled x 1 is the root node; it has a pseudochild, node x 4 . The node labeled x 4 has two pseudo-parents nodes: x 2 and x 1 . The solid lines describe tree edges, while the dotted lines represent backedges.
Example 2

Definition 3 (Projection)
The projection of a cost function f i on a set of variables
In other words, f i|V is constructed from the tuples of f i , removing the values of the variable that do not appear in V and removing duplicate values by keeping the minimum cost of the original tuples in f i .
Definition 4 (Concatenation) Let us consider two assignments θ , defined for variables V , and θ , defined for variables W , such that for each x ∈ V ∩ W we have that θ (x) = θ (x). Their concatenation is an assignment θ · θ defined for V ∪ W , such as for each x ∈ V (resp. x ∈ W ) we have that θ · θ (x) = θ (x) (resp. θ · θ (x) = θ (x)).
We define two operations on cost functions:
• The aggregation of two functions f i and f j , is a function
• The elimination of a variable x j ∈ x i from a function f i , denoted as π −xj (f i ), produces a new function with scope x i \ {x j }, and defined as the projection of f i on x i \ {x j }, i.e.,
While the aggregation and elimination operators are defined on summation and minimization, respectively, for discrete optimization problems, several tasks in belief networks can be solved by using variants of the aggregation and elimination operators [16] .
Bucket Elimination
Bucket Elimination (BE) [15, 16] is a complete inference algorithm that can be used to find the optimal solutions of a WCSP. Algorithm 1 illustrates its pseudocode. BE operates in the following two phases: • Variable Elimination Phase. BE operates from the highest to lowest priority variable. When operating on variable x i , it creates a bucket B i , which is the set of all cost functions that involve x i as the highest priority variable in their scope (line 2).
The algorithm then computes a new cost functionf i by aggregating the functions in B i and eliminating x i (line 3). Thus, x i can be removed from the set of variables X to be processed (line 4) and the new functionf i replaces in C all the cost functions that appear in B i (line 5). Thus, a bucket B i contains both the original WCSP functions as well as the functions placed in it during the variable elimination process. In this work, we refer to thef i functions as the bucket functions.
• Value Assignment Phase. Once the variable with the lowest priority has been processed, the algorithm considers variables in increasing order of priority. For each variable x i , it generates an optimal assignment by selecting a value d i ∈ D xi that minimizes the cost of the functions in B i given the assignments of all the other variables appearing in the scope of the functions in B i . As a byproduct, and without additional overhead, BE can compute the number of consistent solutions of the problem (see [16] , for details). The time and space complexity of BE is exponential on the induced width of the underlying primal graph, which captures the maximum arity of thef i functions (line 3).
Example 3 In our WCSP example of Fig. 1 , during the Variable Elimination Phase, BE operates, in order, on the variables x 4 , x 3 , x 2 , and x 1 . When x 4 is processed, the bucket B 4 = {f (x 1 , x 4 ), f (x 2 , x 4 ), f (x 3 , x 4 )} is generated, and highlighted in Fig. 2 (a)(top) by red edges. The resulting bucket functionf 4 is shown in Fig. 2(a)(bottom) , where the rightmost column shows the values for x 4 after its elimination. BE, hence, updates the sets X = {x 1 , x 2 , x 3 } and C = {f (x 1 , x 2 ), f (x 2 , x 3 ),f 4 }, as shown in the primal graph of Fig. 2(b)(top) , where the functionf 4 is displayed as a dotted line. When x 3 is processed, B 3 = {f (x 2 , x 3 ),f 4 }, andf 3 is shown in Fig. 2(b)(bottom) . Thus, X = {x 1 , x 2 } and C = {f (x 1 , x 2 ),f 3 }, as shown in Fig. 2 (c)(top). Next, x 2 is processed, and B 2 = {f (x 1 , x 2 ),f 3 }, and f 2 is illustrated in Fig. 2(c)(bottom) . Thus, X = {x 1 } and C = {f 2 }, as shown in Fig. 2(d)(top) . Lastly, the algorithm processes x 1 , sets B 1 = {f 2 }, andf 1 is minimized when x 1 = 0, as shown in Fig. 2(d) 
and d i is the best extension of x 1 , . . . , x i−1 w.r.t. B i 10 returnf 1 ment Phase, which operates, in order, on the variables x 1 , x 2 , x 3 , and x 4 . First, it selects the value that minimizesf 1 , (x 1 = 0). Thus, it processes x 2 , and selects the value x 2 = 1, as it minimizesf 2 when x 1 = 0, as illustrated in Fig. 2(c)(bottom) . Similarly, when BE processes x 3 , it selects the value x 3 = 0, as it minimizesf 3 when x 1 = 0 and x 2 = 1, illustrated in Fig. 2(b)(bottom) . Finally, BE processes the last variable x 4 and assigns it the value 1, since it minimizesf 4 when x 1 = 0, x 2 = 1, and x 3 = 0, illustrated in Fig. 2(a)(bottom) . Thus, σ * = 0, 1, 0, 1 is an optimal solution to the problem, with cost 4.
Mini-Buckets
The memory complexity and time complexity of BE depend on the arity of the functionsf produced during the variable elimination step. Such requirements can quickly become infeasible for problems with large induced widths. To overcome this limitation, Dechter and Rish proposed an incomplete version of the Bucket Elimination [19] . The Mini-Bucket Elimination (MBE) is an approximated version of the BE that allows one to bound the arity of the functionsf i generated during the Variable Elimination Phase. Its pseudocode is illustrated in Algorithm 2. Similarly to BE, MBE operates in the following two phases:
• Variable Elimination Phase. As in BE, during the variable elimination phase, MBE operates on the problem variables in decreasing order of priority. However, rather than creating a single bucket functionf i whose scope is the union of the scope of each function in the bucket B i , it partitions B i in a set of m "mini"-buckets {B i1 , . . . , B im }, such that the size of the scope of the bucket functionf i k , obtained by aggregating the functions in B i k , is bounded by a parameter z, for each k ∈ {1, . . . , m} (line 3). Thus, MBE considers each mini-bucket independently, and computes m new bucket functionsf i k , by aggregating the functions in B i k and eliminating x i (line 5). These functions replace in C all the functions that appear in B i k (line 6), and the set of variables is updated as in BE (line 7).
• Value Assignment Phase. This phase is analogous to that of BE (lines [8] [9] . Consider the elimination step for a variable x i ∈ X. Since:
eliminating x i using mini-buckets produces a lower bound on the optimal cost for the bucket B i . Thus, MBE produces a lower bound on the optimal solution cost. Running the Value Assignment Phase might hence return a sub-optimal solution, whose evaluation will be an upper bound on the optimal solution cost.
The time and space complexity of MBE is exponential on the maximal arity of the aggregated functions in the mini-buckets (line 13), and thus it is bounded by the parameter z.
Example 4 Consider the WCSP of Fig. 1 solved with MBE using z = 1. As in BE, during the Variable Elimination Phase MBE operates, in order, on the variables x 4 , x 3 , x 2 , and x 1 . When x 4 is processed, the bucket B 4 = {f (x 1 , x 4 ), f (x 2 , x 4 ), f (x 3 , x 4 )}-illustrated by the red edges in Fig. 2(a) topwould result in aggregated bucket function whose arity is 3, and thus exceeds the maximal arity allowed. Thus, MBE creates a partition {B 41 , B 42 , B 43 } for B 4 , whose sets consists of the functions, respectively, f (x 1 , x 4 ), f (x 2 , x 4 ), and f (x 3 , x 4 ). The resulting functionsf 41 ,f 42 , andf 43 have arity 1, as illustrated in Fig. 2(a) bottom. Then, MBE updates the sets X to {x 1 , x 2 , x 3 } and C to {f (x 1 , x 2 ), f (x 2 , x 3 ),f 41 ,f 42 ,f 43 }, as shown in the primal graph of Fig. 2(b) top. When x 3 is processed, B 3 = {f (x 2 , x 3 ),f 43 }, marked red in Fig. 2(b) top, and the mini-bucket B 31 = B 3 . The resulting bucket functionf 31 is shown in Fig. 2 Fig. 2 (c) bottom. Thus, X = {x 1 } and C = {f 41 ,f 21 }. Lastly, the algorithm processes x 1 , sets B 1 = B 11 = {f 41 ,f 21 }, andf 11 is minimized when x 1 = 1, as shown in Fig. 2(d) bottom. The Value Assignment Phase is analogous to the process carried by BE, except that when processing variable x 4 MBE assigns it the value 1, since it minimizesf 41 +f 42 +f 43 when x 1 = 0, x 2 = 1, and x 3 = 0 ( Fig. 2(a) bottom) . Thus, σ * = 0, 1, 0, 1 is the reported solution to the problem, with a lower bound cost of 2.
Background: Belief Networks and Most Probable Explanation
A belief network (BN) [50] is a tuple X, D, P , where X = {x 1 , . . . , x n } is a set of ordered variables defined over finite domains D = {D x1 , . . . , D xn }, with o an ordering of the variables in X, and P is a set of conditional probability tables (CPTs).
A CPT f i = {Pr(x i |pa i )} of P denotes the join probability of x i with respect to the variables in pa i , and pa i ⊆ {x j ∈ X|x i ≺ o x j } is the set of variables with higher priority of x i in the ordering o, also called parent variables of x i . A BN B represents the probability distribution over the variables in X:
where σ is a complete assignment for the variables in X. The scope of a CPT f i ∈ P is the set x i = {x i } ∪ pa i . An evidence set σ E is an assigned subset of variables E ⊆ X.
A BN B is represented through a directed acyclic graph G B = (X, E P ), where (y, x i ) ∈ E P iff ∃f i ∈ P such that y ∈ pa i . In other words, E P is the set of all directed arcs from each parent variable of x i to x i , for every x i ∈ X. The primal graph of a BN is called moral graph and it connects any two variables appearing in the same CPT. Example 5 Fig. 4 (a) illustrates a simple belief network with 4 binary variables: x 1 , x 2 , x 3 and x 4 , representing respectively the observations for a patient to have flu, allergy, fever, and sinus infection, and 4 CPTs, each associated to a node and its parent nodes. For example, the CPT table illustrated in Fig. 4 (b) (top), describes the probability the patient has fever given that she/he does has have flu. The conditional probability for x 4 = 0 is implied since the probabilities need to sum up to 1. Similarly, the CPT table of Fig. 4 (b) (bottom) describes the probabilities the patient has sinus infection for each combination of outcomes for flu and allergy. The BN represents the joint probability distribution:
Its moral graph is illustrated in Fig. 4 (c).
One of the main tasks posed over belief networks is that of finding the maximum probably explanation (MPE). Given a BN B, and an evidence set E, finding the MPE correspond to finding a complete assignment for the variables of B that has the maximal probability given the evidence E. More formally, the goal of the MPE is that of finding a complete assignment σ * such that:
where σ pa i is the projection of σ to the variables in pa i . This problems can be solved with small variations of the Bucket Elimination algorithm presented in section 2.1. Bucket Elimination can be adapted to solve MPE tasks on belief networks where the min operator in the projection within the elimination operation is substituted by max operator and the summation in the aggregation operator is substituted by the product (for more details we refer the reader to [16] ).
Background: Distributed Constraint Optimization Problems (DCOPs)
In a Distributed Constraint Optimization Problem (DCOP) [47, 53, 67] , the variables, domains, and cost functions of a WCSP are distributed among a collection of agents. A DCOP is defined as X, D, C, A, α , where X, D, and C are defined as in a WCSP, A = {a 1 , . . . , a p } is a set of agents, and α : X → A maps each variable to one agent. Following common conventions, we assume that α is a bijection: Each agent controls exactly one variable. Thus, we will use the terms "variable" and "agent" interchangeably and assume that α(x i ) = a i . In DCOPs, solutions are defined as for WCSPs, and many solution approaches emulate those proposed in the WCSPs literature. For example, ADOPT [47] is a distributed version of Iterative Deepening Depth First Search, and DPOP [53] is a distributed version of BE. The main difference is in the way the information is shared among agents. Typically, a DCOP agent knows exclusively its domain and the functions involving its variable. It can communicate exclusively with its neighbors (i.e., agents directly connected to it in the primal graph 3 ) , and the exchange of information takes the form of messages. Given a DCOP P , and a DFS pseudo-tree T for the primal graph G P , we use N (a i ) = {a j ∈ A | {x i , x j } ∈ E C } to denote the neighbors of agent a i ; and sep(a i ) to denote the separator of agent a i , which is the set of ancestor agents that are constrained (i.e., they are linked in G P ) with agent a i or with one of its descendant agents in the pseudo-tree T . Fig. 1(a-b) illustrate an example of a DCOP instance with 4 agents, a i (i ∈ {1 . . . , 4}), each controlling one variable, x i . The problem variables, domains and constraints are analogous to those of the WCSP of Example 1. Fig. 1(d) shows one possible pseudo-tree for the DCOP instance, where the agents a 1 and a 2 have one pseudo-child: a 4 . The dotted lines represent backedges.
Example 6
Dynamic Programming Optimization Protocol (DPOP)
DPOP [53] is a dynamic programming based DCOP algorithm that is composed of three phases:
• Pseudo-tree Generation Phase. In this phase the agents coordinate the construction of a pseudo-tree, realized through existing distributed pseudo-tree construction algorithms [35] .
• UTIL Propagation Phase. Each agent, starting from the leaves of the pseudo-tree, computes the optimal sum of costs in its subtree for each value combination of variables in its separator set. The agent does so by aggregating the costs of its functions with the variables in its separator and the costs in the UTIL messages received from its child agents, and then eliminating its own variable.
• VALUE Propagation Phase: Each agent, starting from the root of the pseudo-tree, determines the optimal value for its variable. The root agent does so by choosing the value of its variable from its UTIL computations-selecting the value with the minimal cost. It sends the selected value to its children in a VALUE message. Each agent, upon receiving a VALUE message, determines the value for its variable that results in the minimum cost given the variable assignments (of the agents in its separator) indicated in the VALUE message. Once determined, such assignment is further propagated to the children via VALUE messages.
Example 7 In our example problem, after coordinating to construct the pseudo-tree ( Fig. 1(d) ), agent a 4 , being the leaf of the pseudo-tree, starts the UTIL propagation phase, by computing the optimal cost for each value combination of variables x 1 , x 2 , and x 3 ( Fig. 2 (a)(bottom)), and sending the costs to its parent agent a 3 in a UTIL message. Upon receiving the UTIL messages from each of its children, agents a 3 and a 2 follow an analogous process. When the root agent a 1 receives the UTIL message from each of its children, it computes the minimum cost of the entire problem, and starts the VALUE propagation phase. It selects the value for its variable that minimizes the problem cost ( Fig. 2 (d)(bottom)) and sends this value down to the pseudo-tree to its child, a 3 , in a VALUE message. Upon receiving a VALUE message from its parent, each agents follows the same process.
The time and the space complexities of DPOP are dominated by the UTIL Propagation Phase, which is exponential in the size of the largest separator set sep(a i ) for all a i ∈ A. The other two phases require a polynomial number of linear sized messages (in the number of variables of the problem), and the complexity of the local operations is at most linear in the size of the domain [53] .
Observe that the UTIL Propagation Phase of DPOP emulates the Variable Elimination Phase of BE in a distributed context [10] . Given a pseudo-tree and its ordering o, the UTIL message generated by each DPOP agent a i is equivalent to the aggregated and projected functionf i in BE when x i is processed according to the ordering o.
Approximate Distributed Pseudotree Optimization
Analogously to how DPOP emulates BE in the distributed context, the Approximate Distributed Pseudotree Optimization (ADPOP) algorithm emulates MBE to solve DCOPs [52] . ADPOP has the same three phases as DPOP, and given a pseudo-tree and its ordering o, the content of the UTIL messages generated by each ADPOP agent a i is equivalent to the bucket functionsf ij (j ∈ {1, . . . , i m }) in MBE when x i is processed according to the ordering o.
The complexity of ADPOP is exponential in the input parameter z, while its VALUE Propagation Phase has the same order complexity of the VALUE Propagation Phase in DPOP. Modern Graphics Processing Units (GPUs) are massive parallel architectures, offering thousands of computing cores and a rich memory hierarchy to support graphical processing (e.g., DirectX and OpenGL APIs). NVIDIA's Compute Unified Device Architecture (CUDA) [58] aims at enabling the use of the multiple cores of a graphic card to accelerate general purpose (non-graphical) applications by providing programming models and APIs that enable the full programmability of the GPU. The computational model supported by CUDA is Single-Instruction Multiple-Threads (SIMT), where multiple threads perform the same operation on multiple data points simultaneously.
A GPU is constituted by a series of Streaming MultiProcessors (SMs), whose number depends on the specific characteristics of each class of GPU. For example, the Fermi architecture provides 16 SMs, as illustrated in Fig. 5 (left). Each SM contains a number of computing cores, each of which incorporate an ALU and a floating-point processing unit. Fig. 5 (right) shows a typical CUDA logical architecture. A CUDA program is a C/C++ program that includes parts meant for execution on the CPU (referred to as the host) and parts meant for parallel execution on the GPU (referred as the device). A parallel computation is described by a collection of GPU kernels, where each kernel is a function to be executed by several threads. When mapping a kernel to a specific GPU, CUDA schedules groups of threads (blocks) on the SMs. In turn, each SM partitions the threads within a block in warps 4 for execution, which represents the smallest work unit on the device. Each thread instantiated by a kernel can be identified by a unique, sequential, identifier (T id ), which allows to differentiate both the data read by each thread and code to be executed. 
T id
Wrap of threads . . .
Global Memory
Fig. 6: Coalesced (left) and scattered (right) data access patterns.
Memory Organization
GPU and CPU are, in general, separate hardware units with physically distinct memory types connected by a system bus. Thus, in order for the device to execute some computation invoked by the host and to return the results back to the caller, a data flow needs to be enforced from the host memory to the device memory and vice versa. The device memory architecture is quite different from that of the host, in that it is organized in several levels differing to each other for both physical and logical characteristics.
Each thread can utilize a small number of registers, 5 which have thread lifetime and visibility. Threads in a block can communicate by reading and writing a common area of memory, called shared memory. The total amount of shared memory per block is typically 48KB. Communication between blocks and communication between the blocks and the host is realized through a large global memory. The data stored in the global memory has global visibility and lifetime. Thus, it is visible to all threads within the application (including the host), and lasts for the duration of the host allocation.
Apart from lifetime and visibility, different memory types have also different dimensions, bandwidths, and access times. Registers have the fastest access memory, typically consuming zero clock cycles per instruction, while the global memory is the slowest but largest memory accessible by the device, with access times ranging from 300 to 600 clock cycles. The shared memory is partitioned into 32 logical banks, each serving exactly one request per cycle. Shared memory has an extremely small access latency, provided that multiple thread memory accesses are mapped to different memory banks.
Bottlenecks and Common Optimization Practices
While it is relatively simple to develop correct GPU programs (e.g., by incrementally modifying an existing sequential program), it is nevertheless challenging to design an efficient solution. Several factors are critical in gaining performance. In this section, we discuss a few common practice that are important for the design of efficient CUDA programs.
Memory bandwidth is widely considered to be an important bottleneck for the performance of GPU applications. Accessing global memory is relatively slow compared to accessing shared memory in a CUDA kernel. However, even if not cached, global accesses covering a contiguous 128 Bytes data are fetched at once. Thus, most of the global memory access latency can be hidden if the GPU kernel employs a coalesced memory access pattern. Fig. 6(left) illustrates an example of coalesced memory access pattern, in which aligned threads in a warp accesses aligned entries in a memory segment, which results in a single transaction. Thus, coalesced memory accesses allow the device to reduce the number of fetches to global memory for every thread in a warp. In contrast, when threads adopt a scattered data accesses (Fig. 6(right) ), the device serializes the memory transaction, drastically increasing its access latency.
Data transfers between the host and device memory is performed through a system bus, which translates to slow transactions. Thus, in general, it is convenient to store the data onto the device memory. Additionally, batching small memory transfers into a large one will reduce most of the per-transfer processing overhead [58] .
The organization of the data in data structures and data access patterns play a fundamental role in the efficiency of the GPU computations. Due to the computational model employed by the GPU, it is important that each thread in a warp executes the same branch of execution. When this condition is not satisfied (e.g., two threads execute different branches of a conditional construct), the degree of concurrency typically decreases, as the execution of threads performing separate control flows can be serialized. This is referred to as branch divergence, a phenomenon that has been intensely analyzed within the High Performance Computing (HPC) community [36, 14, 20] .
GPU-based (Distributed) Bucket Elimination (GPU-(D)BE)
Our GPU-based (Distributed) Bucket Elimination (GpuBE) framework, extends BE and MBE (DPOP and ADPOP, respectively) by exploiting GPU parallelism within the aggregation and elimination operations. These operations are responsible for the creation of the functionsf i in BE andf i k in MBE (lines 3 and 5 of Algorithms 1 and 2, respectively) and the UTIL tables in DPOP and ADPOP (UTIL Propagation Phase), and they dominate the complexity of the algorithms. Thus, we focus on the details of the design and the implementation relevant to such operations. The key observation that allows us to parallelize these operations is that the computation of the cost for each value combination in a bucket function is independent of the computation in the other combinations. The use of a GPU architecture allows us to exploit such independence, by concurrently exploring several value combinations of the bucket function, computed by the aggregation operator, as well as concurrently eliminating out variables.
Due to the equivalence of BE (resp. MBE) and DPOP (resp. ADPOP), we will refer to the bucket functionsf and UTIL tables resulted by the aggregation and elimination operations of Algorithms 1 and 2, as well as variables and agents, interchangeably.
GPU Data Structures
In order to fully capitalize on the parallel computational power of GPUs, the data structures need to be designed in such a way to limit the amount of information exchanged between the CPU host and the GPU device, minimizing the accesses to the (slow) device global memory, while ensuring that the data access pattern enforced is coalesced. To do so, we store into the device global memory exclusively the minimal information required to compute the bucket functions, which are communicated to the GPU once at the beginning of the computation of each bucket or mini-bucket. This allows the GPU kernels to communicate with the CPU host exclusively to exchange the results of the aggregation and elimination processes.
We introduce the following concept:
Definition 5 (Bucket-table) A bucket-table is a 4-tuple, T = S, R, χ, ≺ , where:
• S ⊆ X, is a list of variables denoting the scope of T .
• R is a list of tuples of values, each tuple having length |S|. Each element in this list (called row of T ) specifies an assignment of values for the variables in S that is consistent with their domains. We denote with R[i] the tuple of values corresponding to the i-th row in R, for i = {1, . . . , |R|}.
• χ is a list of length |R| of cost values corresponding to the costs of the assignments in R. In particular, the element χ[i] represents the cost of the assignment R[i] for the variables in S, with i = {1, . . . , |R|}.
• ≺ denotes an ordering relation used to sort the variables in the list S. In turn, the value assignments, and cost values, in each row of R and χ, respectively, obey to the same ordering.
As a technical note, a bucket table T is mapped onto the GPU device to store exclusively the cost values χ, not the associated variables values. We assume that the rows of R are sorted in lexicographic order-thus, the i-th entry χ[i] is associated with the i-th permutation R[i] of the variable values in S, in lexicographic order. This strategy allows us to employ a simple perfect hashing to efficiently associate row numbers with variables' values. We will elaborate on this topic in Section 6.3. Additionally, all the data stored on the GPU global memory is organized in mono-dimensional arrays, so as to facilitate coalesced memory accesses.
Algorithm Overview
Algorithm 3 illustrates the pseudocode of GpuBE, where z is an input parameter denoting the maximal mini-bucket size to be processed. We use the following notations: -Starred line numbers denote those instructions that are executed concurrently by both the CPU and the GPU. -The symbols ← and ⇔ denote sequential and parallel (i.e., multiple GPU threads) operations, respectively. -If a parallel operation requires a copy from host (device) to device (host), we write D←H ⇔ ( H←D ⇔ ). Host to device (device to host) memory transfers are performed immediately before (after) the execution of the GPU kernel.
GpuBE is composed of three phases: (1) Variable Ordering, (2) Variable Elimination, and (3) Variable Assignment. Let us consider N (x i ) = {x j ∈ X | {x i , x j } ∈ E C }, defined analogously as for the agents' case. During the first phase (line 1), the problem variables are sorted according to a pseudo-tree ordering relation; in particular, we apply the following heuristics in the construction of the pseudo-tree:
For the distributed case, this phase is identical to that of (A)DPOP, where the agents coordinate the construction of a pseudo-tree, using an off-the-shelf message-passing algorithm [35] .
In the second phase, the algorithm processes each variable, in descending order, according to the relation ≺ T , and proceeds as in (M)BE:
• The function CPU::CONSTRUCTBUCKET constructs the bucket B i as illustrated in Algorithm 1, line 2. The algorithm proceeds in creating a partition of this bucket, if required (i.e., if z < w * ). This phase differs slightly in the distributed case, where each agent, upon receiving a new bucket function from its descendant agents, inserts it into its bucket set B i .
• For each mini-bucket B i k (k = 1, . . . , m), GpuBE determines and reserves the amount of global memory to be assigned to each associated bucket-table T i k (line 6). After the GPU::RESERVE function is invoked, a space sufficient to store the bucket-table is allocated, and its cost values χ i k are initialized to 0.
• Thus, GpuBE aggregates the bucket-table T j associated to each function f j in the mini-bucket with the bucket-table T i k (lines 7-9). To do so, it first creates a buckettable T j that encodes the cost values of the bucket function f j , reordering them, if necessary, according to the order on its scope specified by the pseudo-tree relation ≺ T (line 8). This procedure requires a memory transfer from the CPU host to the GPU device global memory. Then, it adds the values χ j of the aggregating buckettable T j into the corresponding entries of the bucket-table T i k (line 9). We will further discuss the details of this function, as well as the other kernel functions, in the next sections.
• Finally, the algorithm invokes a GPU call to eliminate the variable x i from the bucket-table T i k , thereby constructing the bucket functionf i k , which is, finally, copied back to the CPU host memory (line 10). In the distributed case, each agent processes lines 3-6 in parallel without prior coordination. Starting from the leaves of the pseudo-tree, the agents build their UTIL messages containing the bucket functions (lines 5-10), and send them to their parents. Thus, each agent waits to receive the UTIL messages from all of its children before performing the aggregation and elimination operations (lines 7-9 and line 10, respectively) for each mini-bucket. By the end of this phase (line 10), the root agent knows the overall cost for each values of its variable x i . Thus, it chooses the value that results in the minimum cost, and it starts the third phase by sending to each child agent the value of its variable x i .
In the centralized case, when space is not a concern, there is no need of copying the bucket tables back to the host, after the variable elimination step (line 10). Thus, two memory transfer transactions are avoided for each variable being processed.
In the third phase, the algorithm proceeds analogously to as done in (M)BE. For the distributed case, the agents select the values for their variables that minimize their bucket functions costs, given the assignments of their ancestor agents, and send them in VALUE messages to their children. These operations are repeated by every agent receiving a VALUE message until the leaf agents are reached.
While we described the case in which the underlying problem primal graph is connected, our implementation allows us to handle disconnected graphs. This is done by solving the sub-problems in each connected subgraph independently from other subproblems, and retrieving the problem cost by aggregating the costs stored in the root of each pseudo-tree associated to the connected graphs.
GPU-based Constraint Aggregation
We now describe the implementation of the constraint aggregation GPU kernel. This operation, takes as input two bucket-tables: T i k and T j , and aggregates the cost values in χ j to those of χ i k for all the corresponding assignments of the shared variables in the scope of the two bucket-tables. We refer to T i k and T j as to the output and input bucket-tables, respectively.
Consider the example in Fig. 7 , the cost values χ j of the input bucket-table T j (right) are aggregated to the cost values χ i k of the output bucket-table T i k (left)-which where initialized to 0. The rows of the two tables with identical value assignments for the shared variables x 2 and x 3 are shaded with the same color.
To optimize performance of the GPU operations and to avoid unnecessary data transfer to/from the GPU global memory, we only transfer the list of cost values χ for each bucket-table that need to be aggregated, and employ a simple perfect hashing function to efficiently associate row numbers with variables' values. This allows us to compute the indices of the cost vector of the input bucket-table relying exclusively [5]
[6] [7] x2 x3 on the information of the thread ID and, thus, avoiding accessing the scope S and assignment vectors R of the input and output bucket-tables. We now discuss how this process can be efficiently handled on the GPU kernels. Let T out = S out , R out , χ out , ≺ out be the output bucket-table, whose scope is -table χ out , the corresponding row index r in associated to the input bucket-table cost array χ in is given by:
Each term in the summation of Equation (3) (1), respectively, for each k = {1, . . . , s − 1}, and copied onto the GPU global memory with one copy transaction-we allocate them as a single mono-dimensional array.
In order to exploit the highest degree of parallelism offered by the GPU device, we (1) map one GPU thread T id to one element of the output bucket-table r out and (2) adopt the ordering relation ≺ T for each input and output bucket-table processed.
Adopting such techniques allows each thread to be responsible of performing exactly two reads and one write from/to the GPU global memory. Additionally, the ordering relation enforced on the bucket-tables allows us to exploit the locality of data and to encourage coalesced data accesses. As illustrated in Fig. 7 , this paradigm allows threads (whose IDs are identified in red by their T id 's) to operate on contiguous chunks of data and, thus, minimizes the number of actual read (from the input bucket-table, on the right) and write (onto the output bucket-table, on the left) operations from/to the global memory performed by a group of threads with a single data transaction. 6 The constraint aggregation GPU kernel is described in Procedure Gpu::Aggregate, which is computed in parallel by a number of threads equal to the number of rows of the output bucket-table. Each thread identifies its row index r i k within the output bucket-table cost values array χ ir based on its thread ID (line 1), and it initializes a variable that will contain the input bucket-table row index to 0 (line 2). It then copies into the shared memory the static entities mul , div , and mod associated to the aggregation of the the bucket-tables being processed (line 4). A further inspection to the Gpu::Aggregate procedure reveals how it makes use of the auxiliary data structures above to efficiently implement the hash function of equation (3), and retrieve the entry index of the input bucket-table associated to the variables value permutation of the output bucket- Note that this algorithm highly fits the SIMT paradigm adopted by GPUs; the thread ID and the auxiliary mul , div , and mod arrays are used to retrieve and update all the data necessary to compute the output bucket-table. Additionally, the accesses to the global memory are minimized, as the auxiliary arrays are copied into the shared memory.
We illustrate the above process in the following example.
Example 8 Consider the operation of aggregating the input bucket-table T j with the bucket-table T i k of Fig. 7 corresponding, respectively, to the bucket-table representing the constraint f 23 and the bucket-tablef 3 (before eliminating the variable x 3 ) in Fig. 2(b) . With the Equation (3) notation, s = 2, m = 3 and, thus, the index k of the summation ranges from 1 to s − 1 = 1. Therefore:
Therefore, the mapping from the thread IDs (or, equivalently, the output bucket-table row indices r i k ) to the input bucket-table row indices r j is:
As a technical detail, the bucket-tables are created and processed so that the variables in their scope are sorted according to the order ≺ T . This means that the variables with the highest priority appear first in the scope list, while the variable to be eliminated always appear last. We will see, in the next section, that such detail allows us to efficiently encode the elimination operation on the GPU. To fully capitalize on the use of the GPU, we exploit an additional level of parallelism, achieved by running GPU kernels and CPU computations concurrently (lines 8-9 of Algorithm 3). This is possible when the T j bucket-tables can be partitioned in multiple chunks. Fig. 8 illustrates this operation. After transferring the first bucket-table chunk (T j #1 ) into the device memory, the process starts the execution of the Gpu::Aggregate() kernel, which operates on this portion of the bucket table (called Kernel #1 in Fig. 8) . Thus, the control immediately returns to the CPU host, which enforces the next data transfer onto the device memory, through a call to a GPU::RESERVE(T j #2 ). A host-device synchronization point is imposed after each memory transfer (except the first one), to ensure that no overlapping Gpu::Aggregate() GPU kernels are enforced. 
GPU-based Variable Elimination
We now describe the implementation of the variable elimination GPU kernel. This operation takes as input a bucket-table T i k and a variable x i ∈ S i k and removes this variable from the bucket-table's scope, optimizing over its cost rows. As a result, the output bucket-table rows list the unique assignments for the value combinations of S i k \ {x i } in the input bucket-table R i k which minimizes the costs values for each d ∈ D xi . Fig. 9 illustrates this process, where the variable x 3 is eliminated from the buckettable T i k . The column being eliminated is highlighted yellow in the input buckettable. The different row colors identify the unique assignments for the remaining variables x 1 , x 2 , and exposes the high degree of parallelization that is associated to such operation. To exploit this level of parallelization, we adopt a paradigm similar to that employed in the aggregation operation on GPU, where each thread is responsible of the computation of a single output element.
Procedure GPU::ELIMINATE(T i k , x i ) 1 r i k ← the thread's entry ID (T id ) 2 r j ← r i k · |Dx i | / * holds the value of the index entry of
The variable elimination GPU kernel is described in Procedure Gpu::Eliminate, which is computed in parallel by a number of threads equal to the number of rows of the output bucket- Fig. 9 ). Thus, the GPU kernel evaluates the input bucket-table cost values associated to each element in the domain of x i , by incrementing the row index r j , |D xi | − 1 times, and chooses the minimum cost value (lines 4-5). At last, it saves to the associated output row the best cost found (line 6). Note that each thread reads |D xi | adjacent values of the vector χ i k , and writes one value in the same vector. Thus, this algorithm (1) perfectly fits the SIMT paradigm, (2) minimizes the accesses to the global memory as it encourages a coalesced data access pattern, and (3) uses a relatively small amount of global memory, as it recycles the memory area allocated for the input bucket-table, to output the cost values for the output bucket-table.
The ordering ≺ T adopted by the bucket-tables makes this procedure effective, by forcing the variables to be eliminated to be always listed as last. Additionally, we note that reordering the bucket-tables scope may be necessary exclusively when constructing the bucket-table associated to the constraints in C. Indeed, the buckettables constructed by the algorithm preserve this ordering over their scope, since all the problem variables are processed according to the same ordering relation ≺ T , guaranteeing that the variables being eliminated are those with lower priority with respect to ≺ T . Therefore, no reordering will be required in the bucket functions during the process.
Finally, to reduce the memory transfer time, in addition to the technique described in the previous section, we unrolled the for-loop in lines 7-9 of Algorithm 3. Doing so allows us to process all the bucket-tables within a mini-bucket in a single GPU kernel and to copy them to the device using a single transaction.
Theoretical Analysis
We report below a theoretical analysis on the runtime and memory complexity of our GpuBE(z) algorithms. For the distributed case, we report results on the network load and messages size complexity provided by the proposed algorithms. The network load and messages size are defined, respectively, as the total number of messages exchanged by the agents and as the size of the largest message exchanged by the agents during problem resolution. Since our algorithms rely on an inference-based procedure, the agent's complexity (i.e., the maximal number of operations performed by the agents while solving the problem) is equivalent to the size of the largest message exchanged. In turn, the latter corresponds to the memory complexity of the algorithm. We use GpuBE(w * ) and GpuDBE(w * ) to denote our GPU versions of BE and DPOP, respectively, and GpuBE(z) and GpuDBE(z) to denote our GPU versions of MBE and ADPOP, respectively, with mini-bucket size z.
Theorem 1 For a problem P , given an ordering ≺ T on the primal graph G P , the (mini-)bucket tables (resp. UTIL messages) constructed by GpuBE(z) (resp. the GpuDBE(z) agents) are identical to those constructed by (M)BE (resp. the (A)DPOP agents), for z ≤ w * .
Proof The proof follows from the observation that GpuBE(z) and (M)BE are executed on the same induced graph G * P . Thus, the problem variables are processed in the same order by both versions of the algorithms-lines 1 and 6 of Algorithm 1 (1 and 8 of Algorithm 2) for (M)BE, and lines 2 and 11 of Algorithm 3 for GpuBE(z). Analogously, in GpuDBE(z) and (A)DPOP, agents operate on the same pseudo-tree ordering.
For the centralized case, during the Variable Elimination Phase, the bucket construction and mini-bucket partitioning operations of GpuBE(z) (lines 3-4 of Algorithm 3) are identical to those of MBE (lines 2-3 of Algorithm 2). For each minibucket B ij in MBE, the operations to create the bucket functionf i k are identical in both algorithms: the effect of invoking the Gpu::Aggregate(T i k , T j ) routine, in GpuBE, for each bucket-table T i k , corresponding to the bucket function f i k (lines 7-9 of Algorithm 3), is analogous to the aggregation operations performed in MBE: F = fj ∈Bi k f j (line 5 of Algorithm 2, in parenthesis), and the effect of the Gpu::Eliminate(T i k , x i ) routine, which projects the variable x i onto the scope of T ij , produces the bucket functionf i k , which in turn correspond to the elimination operation performed by MBE: π −xi (F ) (line 5 of Algorithm 2). For the distributed cases, both ADPOP and GpuDBE(z) agents perform the same operations described above-during the UTIL Propagation Phase-and populate the UTIL messages they send to their parent. The equivalence between the Variable Elimination and UTIL Propagation Phases of BE and DPOP, with the respective phases in GpuBE(w * ) and GpuDBE(w * ), respectively, follows from the process described above differing exclusively in that partitioning B i produces a single bucket with the same functions as those listed in B i .
The operations performed during the Variable Assignment Phases for (M)BE and GpuBE(z) (lines 5-7, Algorithm 1, for BE , lines 8-9, Algorithm 2, for MBE, and lines 11-12, Algorithm 3, for GpuBE(z)) are identical. Additionally, the variables are processed in the same order in both algorithms. Thus, the solution assignment for the problem variables returned by (M)BE and GpuBE are identical. Similarly, for the distributed case, (A)DPOP and GpuDBE(z) agents perform the same VALUE Propagation phase. Proof This result follows from the equivalence of the Variable Elimination Phases of (M)BE and GpuBE(z), and of the UTIL Propagation Phases of (A)DPOP and GpuDBE(z). During these phases, the construction of the (mini)-buckets requires to save, in the worst case, all possible combinations for the value assignments of the bucket-function with bounded arity z. Thus, they require O(d z ) space. Similarly, for the distributed case, due to the equivalence of (A)DPOP and GpuDBE(z), the largest message exchanged by the agents has size O(d z ). Additionally, the total amount of operations (or, equivalently, bucket-tables rows) that can be processed in parallel during the GPU-based Constraint Aggregation and GPU-based Variable Elimination steps, is bounded by a constant value which de-pends on the GPU card characteristic. Thus, the time complexity of GpuDBE(z) is in O(d z+1 ).
Corollary 2 The network load required for GpuDBE(z) is equivalent to the network load required by (A)DPOP.
Proof This result follow from the equivalence of DPOP with GpuDBE(w * ) and ADPOP(z) with GpuBE(z) (Theorem 1). Since (A)DPOP requires each agent to send one UTIL message to its parent and one VALUE message to each of its children, there are a total of n − 1 UTIL/VALUE messages exchanged-one through each tree-edge of the pseudo-tree T P . Thus, the network load required by (A)DPOP and GpuDBE(z) is in O(n).
Corollary 3 Gpu(D)BE is correct and complete.
Proof The correctness and completeness of GPU-(D)BE(w * ) follow from the correctness and completeness of BE [15] and DPOP [53] , and Theorem 1.
Experimental Results
In this section, we evaluate our GPU implementations of BE and MBE (GpuBE) as well as our GPU implementations of DPOP and ADPOP (GpuDBE) and compare them with their CPU counterparts. 7 Experiments for GpuDBE and (A)DPOP are conducted using a multi-agent DCOP simulator that simulates the concurrent activities of multiple agents, whose actions are activated upon receipt of a message. All algorithms use the same variable ordering in the centralized case and pseudo-tree in the distributed case. Performance of the centralized algorithms are evaluated using the algorithms' wallclock runtime, while the performance of distributed algorithms are evaluated using the simulated runtime metric [64] . The main focus of the evaluation is on runtime and speedup achieved by the GPU implementations with respect to their CPU counterparts. Additionally, to compare the quality of the solution bounds reported by the incomplete algorithms, we also report the best solution quality found within the given time limits by toulbar2 [3] , an optimized, exact centralized solver for WCSPs. Toulbar2 is a state-of-the-art solver that uses a depth-first branch-and-bound process to identify a minimum cost assignment and employs the notion of soft local consistency to prune the search space using the problem lower bound.
Our experiments are conducted on an AMD Opteron 6276 with a 2.3GHz CPU and is equipped with a GPU device GeForce GTX TITAN with 14 multiprocessors, 2688 cores, with a clock rate of 837MHz, and 6GB of global memory.
We performed our experiments on both randomly generated instances on different networks topologies and on standard WCSP benchmarks. 8 We first analyze the runtimes of the CPU and GPU versions of BE and DPOP on randomly generated instances, where we report the runtimes and lower bounds of the GPU and CPU versions of MBE and ADPOP at varying of the bucket size z. Then, to ensure that the speedups are not due to a specific GPU device configuration, we compare the CPU and GPU speedups achieved on 3 distinct GPU architectures, characterized by different clock rates, number of SMs, and memory sizes. Finally, we report the solving time and lower bounds of our GpuBE on an extensive set of WCSP benchmarks to verify the generality of the speedups across different domains. Each solver has 1-hour timeout of wallclock time in the centralized case and a 1-hour timeout of simulated time in the distributed case. Additionally, they have a memory limit of 32GB to solve each problem instance. Results are averaged over all instances. If a solver fails to solve an instance is due to either memory limits (labeled oom) or timeout (labeled oot).
Binary Random Networks
The instances for each binary network topology are generated as follows:
• Random: We create an n-node network, whose density p 1 produces n (n−1) p 1 edges in total. We do not bound the tree-width, which is based on the underlying graph and randomly generated.
• Scale-free: We create an n-node network based on the Barabasi-Albert model [6] .
Starting from a connected 2-node network, we repeatedly add a new node, randomly connecting it to two existing nodes. In turn, these two nodes are selected with probabilities that are proportional to the numbers of their connected edges. The total number of edges is 2 (n − 2) + 1.
• Grid: We create an n-node network arranged as a rectangular grid, where each internal node is connected to four neighboring nodes, while nodes on the grid perimeter are connected to three neighboring nodes unless they are at the corner of the grid, in which case they are connected to two neighboring nodes. We generate 50 instances for each topology, ensuring that the underlying graph is connected. The cost functions are generated using random integer costs in [0, 100], and the constraint tightness (i.e., ratio of entries in the cost table that have a cost of ∞) p 2 is set to 0.5 for all experiments. We set the following as default parameters: For the random and scale-free topology, n = 10, d = max Di∈D |D i | = 10, and p 1 = 0.3, and for the grid topology, √ n = 10. Tables 1-3 report the runtime, in seconds, for random, scale-free, and grid topologies, respectively, varying the number of variables (resp. agents) for the centralized (resp. distributed) algorithms, the size of the variables domains, and the constraint tightness of the primal graph. The first four (three) columns of Table 1, (2 and 3) describe the problem setting adopted for each experiment. The induced width w * is averaged across all instances. All other columns report the average runtime and GPU vs. CPU speedup in parenthesis. We make the following observations:
• The GPU-based inference-algorithms are consistently faster that their CPU counterparts, with speedups of up to 307x. Only two exceptions arise for the random networks, where in the small instances with n = 10, d = 5, p 1 = 0.3, and n = 10, d = 10, p 1 = 0.2, the GPU versions of the algorithms are slower than their CPU counterparts.
• The speedup increases with the problem size. In particular, the speedup increases with increasing induced width and with increasing domain size of the problem variables. Both these factors influence the size of the bucket-tables to be processed. 9 This observation corroborates the effectiveness of the GPU parallelism exploited in the construction of these tables.
• As expected, the inference-based algorithms are unable to process instances characterized by large induced widths or large domain sizes, as the size of the buckettables become intractable with the memory limitations. This is evident in the scalefree and grid networks, where the solvers run out of memory for instances with n ≥ 16 and d ≥ 50, respectively.
• The simulated runtimes of the DCOP algorithms are consistently smaller than the wallclock runtimes of the WCSPs ones. This is due to the fact that agents in different branches of the pseudo-tree can compute their bucket-tables independently from each other.
• Finally, the speedup trends of the distributed algorithms are similar to those of the centralized algorithms. Next, we analyze the performance of the individual kernels that implement the constraint aggregation and the variable elimination processes described, respectively, in Sections 6.3 and 6.4. Figure 10 illustrates the average speedup obtained by the GPU-based constraint aggregation, and the GPU-based variable elimination with respect to their CPU-based counterparts when considering the largest bucket processed in each instance of the random, scale-free, and grid network instances. The reported average speedup for the constraint aggregation operations range from 363x (in scale free networks) to 613x (in random networks). The variable elimination operations achieve an even higher speedup, ranging from 830x (for grid networks) to 911x (for scale free networks). This is due to the high locality of data exploited by the GPUbased variable elimination kernel, which encourages coalesced data accesses, and through memory reuse, where we overwrite the input bucket-table of the variable elimination process with the resulting bucket-table from the same process. Next, we compare our centralized and distributed versions of GpuBE with MBE [15] and ADPOP [53] at varying of the mini-bucket size z ∈ {2, . . . , 10}, on binary constraint networks with random, scale-free, and grid topologies, using the same settings described in the previous section. The instances for each topology are generated as described above. Fig. 11(a-c) illustrate the speedup of the CPU and GPU versions of MBE, respectively, on random networks with n = 20, d = 25, p 1 = 0.3, on scale-free networks with n = 20, d = 25, and on grid networks with √ n = 10, d = 25. The intensity of the color illustrates the solution quality of the bound returned (darker color denotes better solution quality). We make the following observations:
• The speedup obtained by the GPU vs. CPU solvers increases as the size of the minibuckets increases. This observation is consistent with the previous observation that the speedup increases with increasing induced widths.
• The speedup saturates when z = 7 in all benchmarks, reporting maximal speedups of 235x, 274x, and 156x, for random, scale-free, and grid networks, respectively. This phenomena occurs when the maximum concurrent number of GPU threads are scheduled and executed simultaneously by all the GPU SMs-i.e., when there is enough work to saturate the GPU maximal occupancy.
• As for the previous experiment, the speedup trends of the distributed algorithms are similar to those of the centralized algorithms. The correlation 10 of the CPU 10 We use the Pearson product-moment correlation coefficient. vs. GPU speedup between the centralized and the distributed solutions are 0.93, 0.95, and 0.99, respectively for the grid, random, and scale-free network topologies. Table 5 illustrates a comparison of the speedups obtained with three different GPU hardware configurations: TESLA M2075, GeForce GTX Titan and GeForce GTX Titan X, whose specifics are summarized in Table 4 .
TESLA M2075
GeForce GTX Titan GeForce GTX Titan X due to the fact that this card can schedule the smallest number of cores per each SMs (32) . Since each core can run concurrently a wrap (32 threads), its maximal level of concurrency is 14 × 32 × 32 = 14, 336 threads (and is thus the maximum number of parallel aggregation operations). In contrast, the speedup obtained by our GpuBE is the highest on the GeForce GTX Titan X-obtaining a maximal speedup of 646.9x-and saturates when z = 8 in all networks. The maximum number of threads that can run concurrently on this card is 24 × 128 × 32 = 98, 304. The speedups obtained by our solver on the GeForce GTX Titan, used in the rest for the experiments in this paper, are larger than those obtained on the TESLA but smaller than those obtained on the GeForce GTX Titan X. This card can run up to 86, 016 threads. In addition to the number of threads than can run concurrently, the GPU clock rate and L2 cache size play a substantial role in the GPU performance. Finally, Fig. 11(d) illustrates the time spent by the GPU devices while executing the kernel functions (in white) in contrast to the time used for memory transfers and allocations (in blue), at varying mini-bucket size z = {4, 6, 8}. These times are averaged among all instances for the three network topologies examined and are normalized with the respect to the wallclock runtime. The results show that the time spent by the device in performing actual computations increases, with the respect to the memory transfer time, as the mini-bucket size increase. Allocations and memory transfers on the Titan device are slower than on the TESLA and the GTX Titan X. Finally, these times account for the 36% to 55%, 18% to 34%, and 8% to 18% of the total time, respectively for the mini-bucket sizes 4, 6, and 8.
WCSPs Benchmarks
We now report the evaluation of our GpuBE on the following standard WCSPs benchmarks:
• Celar: Radio link frequency assignment problems.
• Coloring: Graph coloring instances cast into minimum coloring instances.
• Iscas89: WCSPs derived from digital circuits.
• Pedigree: Instances from the genetic linkage analysis domain that is associated with the task of haplotyping.
• Spot: Instances of the daily photograph scheduling problem of Earth observation satellites. Tables 6-9 , tabulate the results for the above benchmarks. In each table and for each instance, we report, in order, the instance name-as appearing in the original benchmark-the number of variables n of the problem, the maximum size of their domains d, the number of constraints c, the graph density p 1 , and the induced width w * of the underlying primal graph. In each table, the top row shows the runtimes in seconds of GpuBE(z) at varying bucket size z and GpuBE. The bottom row shows the returned solutions' qualities, where for GpuBE(z), we report the lower bound it returned. When GpuBE failed to report a solution (due to memory limits), we report the solution quality found by toulbar2 (shown in parenthesis) or a dash symbol, if toulbar2 did not terminate within the time limit. The speedup of GpuBE(z) and GpuBE w.r.t. their CPU counterparts are shown in parentheses. For each instance, we vary the bucket size z from 2 to 20, and report the minimum bucket size z min , which is the largest constraint arity of the instance, the maximum bucket size z max = min{w z , 20}, where w z is defined as the maximal bucket size that can be processed within the hardware memory limits, and the intermediate bucket sizes
Consistent with our previous observations, the algorithms' speedups and solution qualities increase as the bucket size increases. Additionally, for several large problems instances (e.g., scen06-24reduc-scen06reduc in the Celar benchmark), our GPU implementation of MBE can report good lower bounds quickly (within a few seconds), whereas solving the entire problem with the most competitive soft consistency technique in toulbar2 requires from 6 to 48 minutes. For other large instances (e.g., in the Spot benchmark), we observe that toulbar2 ran out of time for the majority of the instances, while our GpuBE(z) can quickly find lower bounds, which could be used in a AND-and-OR search type as proposed by Marinescu and Dechter [46] .
Related Work
The use of GPUs to solve difficult combinatorial problems has been explored by several proposals in different areas of constraint optimization. For instance, Meyer et al. [39] proposed a multi-GPU implementation of the simplex tableau algorithm that relies on a vertical problem decomposition to reduce communication between GPUs. In constraint programming, Arbelaez and Codognet [5] proposed a GPU-based version of the Adaptive Search algorithm, which explores several large neighborhoods Table 6 : Celar Benchmark: Runtime (in seconds) of GpuBE, at varying of the bucket size z and GpuBE(w * ) (top), and solution quality (bottom). The speedup of GpuBE(z) and GpuBE(w * ) w.r.t. their CPU counterparts are shown in parenthesis.
in parallel, resulting in a speedup factor of 17. Campeotto et al. [12] proposed a GPUbased framework that exploits both parallel propagation and parallel exploration of several large neighborhoods using local search techniques, leading to a speedup factor of up to 38. The combination of GPUs with dynamic programming has also been explored to solve different combinatorial optimization problems. For instance, Boyer et al. [9] proposed the use of GPUs to compute the classical DP recursion step for the knapsack problem, which led to a speedup factor of 26. sented a DP-based solution for the coalition structure formation problem on GPUs, reporting up to two orders of magnitude of speedup. In a recent work, Bistaffa et al. [7] study the parallelization of an inference-based algorithm to solve COPs using GPUs, albeit exclusively in the centralized case. Silberstein et al.
[62] study a GPUbased kernel for the sum-product operations that arise in marginalize a product of functions (MPF) problems. The authors report an average speedup factor of 15 for random benchmarks and Bayesian networks and higher average speedups (up to two orders of magnitude) for log domains due to the difference in performance of the log2f and exp2f functions on the CPU and GPU. In the distributed constraint optimization context, GPU parallelism has been applied to speed up several DCOP solving techniques. Fioretto et al. [28] proposed a multi-variable agent decomposition strategy to solve general DCOPs with complex local subproblems, which makes use of GPUs to implement a search-based and a Table 10 : Spot Benchmark: Runtime (in seconds) of GpuBE, at varying of the bucket size z and GpuBE(w * ) (top), and solution quality (bottom). The speedup of GpuBE(z) and GpuBE(w * ) w.r.t. their CPU counterparts are shown in parenthesis.
sampling-based algorithm to speed up the agents' local subproblems resolution. Le et al. [42] studied a GPU accelerated algorithm in the context of stochastic DCOPsDCOPs where the values of the cost tables are stochastic. The authors used SIMTstyle parallelism on a DP-based approach, which resulted in a speedup of up to two orders of magnitude. Recently, a combination of GPUs with Markov Chain Monte Carlo (MCMC) sampling algorithms has been proposed in the context of solving DCOPs [27] , where the authors adopted GPUs to accelerate the computation of the normalization constants used in the MCMC sampling process as well as to compute several samples in parallel, resulting in a speedup of up to one order of magnitude.
Finally, a parallelization of the AND/OR Branch-and-Bound search with minibucket heuristic has been presented in [41] .
Differently from other proposals, our approach aims at using GPUs to exploit SIMT-style parallelism from DP-based methods to solve general, exact and approximated, WCSPs and DCOPs.
Conclusions and Discussions
Inference-based algorithms are powerful tools for solving discrete optimization problems. However, their applicability is limited by their high time and space requirements. Motivated by the increasing availability of GPUs, in this paper, we proposed a scheme to speed up the resolution of inference-based methods for centralized and distributed constraint optimization by exploiting SIMT-style parallelism. We introduced an exact algorithm and an approximated algorithm that are inspired by BE and MBE for WCSPs and tasks over belief networks (e.g., MPE), and by DPOP and AD-POP for DCOPs. These procedures make use of multiple threads in the GPU cards to parallelize the aggregation and elimination procedures, which are responsible for the high complexity in the inference-based approaches. Additionally, we detailed the design of the data structures adopted to process cost functions with GPUs, and of the mapping adopted to associate GPU threads to cost functions' entries, which allows us to efficiently exploit the data parallelism (SIMT) supported by GPUs.
Finally, we reported an extensive experimental evaluation of our inference-based GPU implementations on both centralized and distributed benchmarks. We showed that the use of GPUs provides significant advantages in terms of runtime and scalability, achieving speedups of up to two order of magnitude, showing a considerable reduction in runtime (up to 345 times faster) with respect to the serialized version, and that the speedups increase with the induced width of the problem and with the size of the domain of the problem's variables.
The proposed results are significant-the wide availability of GPUs provides access to parallel computing solutions that can be used to improve efficiency of WCSPs and DCOP solvers. Furthermore, GPUs are renowned for their complex architectures (multiple memory levels with very different size and speed characteristics; relatively slow cores), which often create challenges to the effective exploitation of parallelism from irregular applications. The strong experimental results indicate that the proposed algorithms are well-suited to GPU architectures.
These results hint that our approach could be exploited in the the context of a dynamic search that makes use of the mini-bucket elimination method as an heuristic to infer bounds on the solution quality, potentially allowing dynamic variable orderings. Indeed, the main drawback of this type of search (and various look-ahead methods) is that, since the heuristic is (re)computed in various nodes during search process, the time invested in the heuristic computation may not be cost-effective. Leveraging the use of GPUs to infer bounds faster during the dynamic search, may therefore produce dramatic speedup to the whole search process.
While this paper describes the applicability of our approach to (M)BE and (A)DPOP, analogous techniques can be derived and applied to other inference-based approaches to solve discrete optimization problems (e.g., to implement the logic of inference-based propagators) and optimization on probabilistic graphical models-(e.g., in finding maximum probability explanation (MPE) in belief networks). Additionally, our work can be extended to solve sum-product problems, also known as weighted counting, partition function, or probability of evidence. Due to the difficulty of these problems the value of accelerated versions of the bucket elimination algorithms is especially important. We plan to work in this direction as future work.
We also envision this technology could open the door to efficiently enforcing higher forms of consistencies than domain consistency (e.g., path consistency [48] , adaptive consistency [18] , or the more recently proposed branch consistency for DCOPs [25] ), especially when the cost functions need to be represented explicitly.
