Abstract-Bucket elimination (BE) is a framework that encompasses several algorithms, including belief propagation (BP) and variable elimination for constraint optimization problems (COPs). BE has significant computational requirements that can be addressed by using graphics processing units (GPUs) to parallelize its fundamental operations, i.e., composition and marginalization, which operate on functions represented by large tables. We propose a novel approach to parallelize these operations with GPUs, which optimizes the table layout so to achieve better performance in terms of increased speedup and scalability. Our approach allows us to process incomplete tables (i.e., tables with some missing variables assignments), which often occur in several practical applications (such as the ones we consider in our dataset). Finally, we can process tables that are larger than the GPU memory. Our approach outperforms the state-of-the-art technique to parallelize BP on GPUs, achieving better speedups (up to +466% with respect to such parallel technique). We test our method on a publicly available COP dataset, measuring a speedup up to 696.02× with respect to the sequential version. The ability of our technique to process large tables is crucial in this scenario, in which most of the instances generate tables larger than the GPU memory, and hence they cannot be solved with previous GPU techniques related to BE.
DP for combinatorial optimization [7] , variable elimination for constraint optimization problems (COPs) [2] and distributed COPs [8] . More precisely, BE operates by propagating messages over a series of buckets, i.e., sets of constraints (or functions) that depend on a given variable. Such messages, encoded by functions in tabular form, are constructed by means of two fundamental operators, ⊕ and ⇓, often referred as composition and marginalization. One of the main advantages of BE is its generality, as such an algorithm can be applied to solve problems in several different fields by changing the fundamental mathematical operations 1 used by these two operators.
While the number of exchanged messages is usually rather small (i.e., equal to the number of input variables minus one), the computation of the messages themselves represents the most computationally intensive task of the entire algorithm, whose complexity can be precisely predicted and it is exponential with respect to a single parameter called induced width of the graph representing the initial problem [2] . For this reason, having a very efficient algorithm to perform such a computation is of utmost importance when dealing with realistic problems, which often require BE to exchange very large messages (i.e., up to 60 GB in our experiments).
In recent years, graphics processing units (GPUs) have been successfully used to speedup the computation in different cybernetic applications that feature a high level of parallelism, achieving performance improvements of several orders of magnitude [10] in fields including computer vision [11] , human-computer interaction [12] , and artificial intelligence [13] . In this paper, we are interested in developing a high-performance GPU framework that allows us to deal with the computational effort inherent in the message passing phase of several BE-based algorithms, so to favor its integration in the development of cybernetic solutions, especially in fields like BP, decision making, and scheduling [14] . To this end, our main objective is to devise a solution that fulfils three key requirements. First, since BE is a general algorithm that can be applied to several problems, our framework should be likewise general to allow a wide adoption among different domains, i.e., BP and COPs. Second, our approach should be able to achieve a high computational throughput, by means of optimized memory accesses to avoid bandwidth bottlenecks, a careful load-balancing to fully exploit the available computational power, and the adoption of well-known parallel primitives [15] , [16] to reduce the CPU workload to the minimum. Third, our solution should tackle large-scale real-world problems, and, hence, it should not be limited by the amount of GPU global memory.
In this context, the work of Zheng et al. [17] (later improved by Zheng and Mengshoel [18] ) represents, to the best of our knowledge, the only approach that tackles the high degree of parallelism of the BE composition and marginalization operators (specifically devised for BP) using GPUs. Specifically, they exploit the fact that such operators execute the same mathematical operation over a large amount of input data and, therefore, they are suitable for the single instruction multiple data model adopted by these architectures [10] . However, this approach has several drawbacks that hinder its applicability in general real-world scenarios. In fact, it accesses input data in a nonregular way by means of an indexing table, which introduces additional overhead and causes the lack of coalescence and data locality in memory transfers. Furthermore, as a consequence of this lack of regularity, input tables must fit entirely into the GPU global memory, since threads may need to access data that could be anywhere in such tables. Therefore, this method is not applicable to instances whose tables exceed this limitation. Finally, since this approach has been specifically devised for BP, where tables are complete (i.e., they include a row for every possible assignment of the variables in their scope), it is unable to cope with real-world problems with incomplete tables, e.g., COPs, weighted CSPs (WCSPs) [14] and graph-constrained coalition formation [19] , where some variable assignments are unfeasible, i.e., they violate some sort of hard constraint. Notice that, even if it is possible to model incomplete tables as complete ones by representing unfeasible assignments as explicit rows with special values (i.e., −∞ for a maximization task), this approach is not practical since it results in tables whose size is not manageable.
Against this background, we propose a novel GPU parallel approach for the BE message passing phase that achieves the objectives set above. In more detail, we advance the state-ofthe-art in the following ways. 1) We propose an algorithm to preprocess input tables by organizing the columns in a specified order, thus achieving full memory coalesced accesses in the message passing phases. Such preprocessing phase can cope with both complete tables and incomplete ones, and can be realized using fully parallelizable operations. We formally analyze such algorithm proving its correctness and giving the worst case computational complexity. 2) We propose an implementation of the GPU kernel that exploits the table layout specified before. We show that such an arrangement enables pipelined data transfers from the host to the GPU (hence optimizing the transfer time) and it allows the use of highly efficient routines for crucial parts of the BE message passing phase (i.e., ⊕ and ⇓ operators). Our method is not limited by the amount of GPU memory, as our data layout allows us to process large tables by splitting them into manageable chunks that meet the memory capabilities of the GPU.
3) We empirically evaluate our approach on two particular BE realizations, i.e., BP and COP, adopting standard datasets in both cases. On the one hand, we compare our approach against the one proposed by Zheng and Mengshoel [18] on the same dataset. Our results show significant improvements, by achieving speedups at least 56% higher than the alternative method (reaching peaks of +466%). On the other hand, we employ our method to solve standard, publicly available WCSPs datasets [14] , by modeling them as COPs and then by adopting the BE algorithm to compute their solution. The tests show that our method results in a speedup of 696× with respect to the serial approach. We prove the importance of handling tables that exceed the GPU global memory in this scenario, which would otherwise be unsolvable given the dimension of the instances.
II. BACKGROUND
The purpose of this section is threefold. First, in Section II-A, we define the theoretical concepts and the algorithms related to BE and BP on junction trees (JTs). An in-depth discussion of such algorithms can be found, respectively, in [2] and [6] . Second, Section II-B outlines the features of GPU architectures, used to implement our highlyparallel approach. Finally, Section II-C discusses previous works related to BE on GPUs.
A. Bucket Elimination
BE [2] is a unifying algorithmic framework, which generalizes DP to accommodate algorithms for many complex problem-solving and reasoning techniques. BE usually accepts inputs in the form of a knowledge-base theory and a query encoded by several functions or relations over subsets of variables (e.g., clauses for propositional satisfiability, constraints, or conditional probability matrices for belief networks). In this paper, we will focus on the resolution of constraint networks (CNs), which consist of a set X = {x 1 , . . . , x n } of n variables such that x 1 ∈ D 1 , . . . , x n ∈ D n , where D i represents the domain of the variable x i , together with a set of m constraints {C 1 , . . . , C m }, denoting the variables simultaneous legal assignments. Nonlegal assignments are denoted as unfeasible. In this paper, we are interested in the version of BE that computes the optimal solution for COPs (Algorithm 1) [2] . COPs can model several realistic problems [8] such as WCSPs [14] .
Definition 1: A COP is a CN augmented with a set of functions. Let F 1 , . . . , F l be l real-valued functional components defined over the scopes Q 1 , . . . , Q l , Q i ⊆ X, let a = (a 1 , . . . , a n ) be an assignment of the variables, where 
In Algorithm 1, the ⊕ operator is instantiated in lines 11 and 13 with the relational join and the join sum operation, respectively. Moreover, these lines instantiate the Algorithm 1 BE COP (CN, F 1 
for all tuples t over W p do 13: ⇓ operator with the project and the maximization operations, respectively. This difference is due to the fact that constraints are relations [2] , as they contain one row per legit assignment of their input variables (while excluding the unfeasible ones). On the other hand, a cost function F i also specifies a value for each row, representing the cost of that particular variable assignment. In the context of the scope of this paper, it is crucial to note that the presence of constraints in all these problems inherently makes some variable assignments unfeasible. As a consequence, all the rows in the exchanged messages corresponding to such assignments can be dropped, greatly reducing the memory requirements of the algorithm. This is not a mere performance optimization, but it is often necessary to achieve a manageable size of the tables. In fact, the size of such tables is exponential in the induced width of the COP [2] , which usually makes this approach not practical if all the rows are explicitly represented in the tables (e.g., assigning −∞ to the unfeasible rows). In order to better understand our contribution to the GPU parallelization of BE, in the following sections we provide a brief description of how composition and marginalization are realized in Algorithm 1.
1) Composition:
We now discuss how the ⊕ composition operator is implemented in Algorithm 1 by the join sum, an operation closely related to the inner join of relational algebra. For the remainder of this thesis, tables are represented according to Definition 2. Moreover, if L is a tuple of elements, we refer to its kth element with L [k] . We adopt the zero-based numbering convention, i.e., tuples start at index 0. value associated to the variable assignment represented by R i [k] , where k ∈ {1, . . . , |R i |}. Suppose we want to compute the join sum between T 1 and T 2 (shown in Fig. 1 ), respectively, associated to two tuples of variables 3 representing the shared variables between T 1 and T 2 . Notice that, as previously stated, some variable assignments are missing in T 1 and T 2 , i.e., the unfeasible assignments.
A row in T 1 matches a row in T 2 if all the shared variables have the same values in both the rows (matching rows have been highlighted with the same color in Fig. 1 ). It is important to note that this is a many-to-many relationship, because multiple rows in the first table can match multiple rows in the second table.
For instance because they all have x 1 = 1 and x 3 = 0. Thus, the result table will have a row for each couple of matching rows in the input tables. In the above example, the corresponding rows in the result table T 1 ⊕ T 2 will be as follows.
These resulting rows are obtained combining the second row of T 1 and, respectively, the first and the fourth rows of T 2 . They both have the same values for the shared variables (x 1 = 1 and x 3 = 0). The values of the nonshared variables (i.e., x 5 and x 8 for T 1 , and x 2 , x 4 , x 6 , and x 10 for T 2 ) are copied from the corresponding matching rows. Hence, in the above example, x 5 = 0 and x 8 = 1 for both the resulting rows (since there is only one matching row in T 1 ), and x 2 = 0, x 4 = 1, x 6 = 1, and x 10 = 0 for the first resulting row (since it results from the match with the first matching row in T 2 ), and so on. Thus, the variable set of the resulting table is the union of the variable sets of the input tables. Finally, the values of the resulting rows are obtained summing the values of the corresponding matching rows, i.e., α 1 + β 0 and α 1 + β 3 . Is it easy to see that if n rows in T 1 match m rows in T 2 , they will result in n · m rows in the resulting table (Fig. 2) .
2) Marginalization: The second fundamental operation, which implements the ⇓ marginalization operator in Algorithm 1, is the maximization. Suppose that, as a result of the inner join sum operation at line 13 of Algorithm 1, we obtain the table T in Fig. 3 . Now, suppose that x p = x 8 . Then, Algorithm 1 requires to maximize such table marginalizing out x 8 , i.e., removing the column corresponding to x 8 . As a result of this removal, some rows may now be equal considering the remaining columns (e.g., R [1] and R [2] both contain 0, 0, 0 in the first three columns, as well as R [3] and R [4] , which contain 1, 0, 0 ). Since one cannot have duplicate rows, the maximization operations computes a single row that, as a value, stores the maximum of the original values. 2 The final result of the maximization of T can be seen in Fig. 4 .
Composition and marginalization operators are also employed by more modern versions of BE, e.g., bucket-tree elimination (BTE) proposed by Kask et al. [20] , hence our contributions are valuable also in the context of these newer algorithms. Here we focus on BE since it was the first version of these message-passing techniques to tackle constrained optimization, and its performance is generally comparable with BTE, which, in turn, is optimized for some specific problems, i.e., singleton-optimality problems.
Notice that DP (and in particular BE) is not the only approach to solve COPs, which can also be tackled with DFS-based approaches [21] . While they represent important techniques in the context of constrained optimization, we do not focus on these algorithms in the context of this paper, since DFS is known to be difficult to parallelize [22] .
We now discuss BP on JTs, which is a close variation of BE [2] that is also based on composition and marginalization operators (i.e., scattering and reduction, respectively).
3) Belief Propagation on Junction Trees: BP on JTs [6] is used to propagate inference on a Bayesian network (BN), a representation of a joint distribution over a set of n random variables X, structured as a directed acyclic graph whose vertices are the random variables and the directed edges represent dependency relationships among the random variables. The propagation of beliefs (or posteriors) runs over a JT, generated from a BN by means of moralization and triangulation [6] . 2 If we marginalize out x i , we maximize over up to d i rows.
Every vertex N i of the JT contains a set Q i ⊆ X of random variables that forms a maximal clique in the moralized and triangulated BN, each associated to a potential 
BP on JTs is then invoked whenever we receive new evidence for a particular set of variables E ⊆ X, updating the potential tables associated to the BN in order to reflect this new information. To this end, a two-phase procedure is employed: first, in the evidence collection phase, messages are collected from each vertex N i , starting from the leaves all the way up to a designated root vertex. Then, during evidence distribution, messages are sent from the root to the leaves. In both phases, each recursive call comprises a MESSAGEPASS procedure, which implements the propagation of beliefs between the potential tables T i and T j in two steps. [18] , we assume that (0/0) = 0. Scattering implements the composition with the product operation, and it corresponds to the ⊕ operator of BE. In both BE and BP on JTs, it is easy to see that the message passing phase require several independent computations spanning over multiple rows of the tables, suggesting a multithreaded algorithm in which such degree of parallelism can be exploited by means of GPUs.
B. GPU Architecture
GPUs are designed for compute-intensive, highly parallel computations. To this end, more transistors are devoted to data processing rather than data caching and flow control. These architectures are especially well-suited for problems that can be expressed as data-parallel computations where data elements are mapped to parallel processing threads. The GPU (device) is mainly employed to implement compute-intensive parts of an application, while control-dominant computations are performed by the CPU (host). In our approach, the GPU is programmed using the CUDA programming framework [10] , which requires the definition of a kernel, a particular routine executed in parallel by thousands of threads on different inputs. Threads are organized in thread blocks, sharing fast forms of storage and synchronization primitives. On the other hand, physical and design constraints limit the number of threads per block, since all the threads of a block are expected to reside on the same streaming multiprocessor (SM) and must share limited memory resources. Memory management is a crucial aspect in the design of efficient GPU algorithms, since memory accesses are particularly expensive and have a significant impact on performance. As a consequence, memory accesses must be carefully devised to achieve high computational throughput (see Section V-A). In what follows, we discuss previous approaches for BE-related algorithms that can be found in the parallel computing literature.
C. Related Work
BP on JTs represents a well-known inference algorithm, which has received significant attention in the parallel computing literature due to its high computational demands. In particular, Xia and Prasanna [23] proposed a distributed approach that combat this by decomposing the initial JT into a set of subtrees, and then they perform the evidence propagation in the subtrees in parallel on a cluster. In this paper, we also focus on exploiting parallel architectures for BP on JTs, but, in contrast, we aim at parallelizing the single propagation operation, which is the most computationally intensive task of the entire algorithm.
The most recent work addressing the parallelization of BP on GPUs is presented by Zheng et al. [17] . In particular, the authors pursue the same goal tackled by our approach discussed in Section III, i.e., parallelize the atomic operations of propagation so that it could be embedded in different algorithms. On the other hand, such an approach proposes a different parallelization strategy, devising a 2-D parallelism, in which an higher level element-wise parallelism is stacked on top of a lower level arithmetic parallelism, to better exploit the massive computational power provided by modern GPUs. In particular, element-wise parallelism is achieved by computing each of the |R ij | reduction-and-scattering operations in parallel, which require |R ij | mapping tables (one per row of Sep ij ) to allow each concurrent task to correctly locate its input data from the corresponding potential tables. On the other hand, arithmetic parallelism represents the multithreaded computation of each reduction-and-scattering operation, by means of well known parallel algorithms that can be found in [10] .
Although this approach represents a significant contribution to the state-of-the-art, there are some drawbacks that hinder its applicability. In particular, the proposed memory layout is not optimized for GPUs, for three main reasons.
1) Threads need to access data in sparse and discontinuous memory locations using an additional indexing table, breaking coalescence, and drastically reducing the throughput of memory transfers (Fig. 8) .
Coalescence is crucial and it should be exploited in order to reduce memory accesses to the global memory, improving the compute-to-memory ratio and achieving a greater computational throughput. 2) Since input data is organized in a discontinuous pattern rather than in continuous chunks, it is mandatory to transfer the entire potential tables to the global memory of the GPU before starting the computation of the BP algorithm, hindering two desirable properties: a) this approach is not applicable to potential tables that do not fit into global memory, since the sparsity of the data prevents any possibility of splitting them into smaller parts and b) since the computation cannot be started before the entire input data has been copied to the GPU, the cost of memory transfers cannot be amortized by means of technologies like NVIDIA CUDA streams. Moreover, the authors devise this technique for BP, where tables are complete (i.e., they include a row for every possible assignment of the variables in their scope). Thus, this approach cannot be applied to problems in which tables are incomplete, e.g., COPs and WCSPs.
In the context of COPs, the only work that specifically focuses on the implementation of the BE algorithm for manycores architectures is the one by Fioretto et al. [24] , in which the authors devise an algorithm to realize the join sum and the maximization operations (referred as aggregate and project) on GPUs, by exploiting the high degree of parallelism inherent in these operations. Although this approach represents a significant contribution to the state-of-the-art, there are some drawbacks that hinder its applicability. First, the indexing of the tables is executed by using a minimal perfect hash function, i.e., a hash function that maps n keys to n consecutive integers, which can be easily adopted as the indices of such keys. Although minimal perfect hash functions can be used in parallel by different threads to index the input, their construction is inherently sequential, since the index of a key depends on the indices assigned to the previously considered keys [25] . This aspect reduces the efficiency of this approach especially on big instances, as shown by our experiments in Section VI-B.
To overcome these limitations, we propose a better way to tackle the GPU computation of the message-passing phase of BE. In particular, we first present a technique that improves the parallelization of BP with complete tables (already published in [26] ), and then we extend it in order to be applicable to the more general case of incomplete tables.
III. PROCESSING COMPLETE TABLES
In this section, we detail our contribution to the GPU computation of messages in BP, exploiting the fact that, in such problem, tables are complete. In particular, we first discuss how we preprocess potential tables in order to index their rows efficiently and achieve coalesced memory accesses. Then, this table layout is exploited by the actual CUDA kernel, which executes the actual message passing phase of BP through highly efficient routines.
A. Table Preprocessing
Suppose we have to propagate new evidence from the potential table T 1 to the potential table T 2 , respectively, associated to two tuples of variables Q 1 = x 3 , x 2 , x 1 and Q 2 = x 5 , x 4 , x 1 , with the shared variables Q 12 = Q 1 ∩ Q 2 = x 1 . We assume that x 1 , x 3 , and x 5 are binary variables, while x 2 and x 4 can assume three values. In the approach by Zheng and Mengshoel [18] , each row of the separator table Sep 12 is assigned to a different block of threads, which are responsible of the reduction of the rows of T 1 with a matching variable assignment and the subsequent scattering on matching rows in T 2 . In Fig. 5 , rows associated to different blocks of threads have been marked in different colors, i.e., white and gray for x 1 = 0 and x 1 = 1, respectively. The organization of input data provided by these tables is undesirable for GPU architectures. In fact, threads responsible of the computation of white rows cannot access consecutive memory addresses, as their data is interleaved with gray rows, thus breaking memory coalescence. Moreover, even if the computation of white rows requires half of the input data, its sparsity forces us to transfer the entire tables to the global memory before starting the algorithm. We propose to solve these issues by means of a preprocessing phase, in which rows associated to the same row in Sep 12 (i.e., rows of the same color, in the above example) are stored in consecutive addresses in the corresponding potential tables, as shown in Fig. 6 . Threads responsible of white rows execute coalesced memory accesses, and start the computation while gray rows are still being transferred to the GPU. Each block of threads easily retrieves its input data without any costly mapping table, in contrast with the approach by Zheng and Mengshoel [18] .
Consider Fig. 6 , resulting from a permutation σ of Q 1 in which the variables in Q 12 are brought to the most significant 3 (MS) positions in Q p = σ (Q). In this way, we can assure that rows with the same assignment of the variables in Q 12 form a contiguous chunk of memory.
Our 
As mentioned before, to maintain a coherent representation of the preprocessed table T p = Q p , d p , R p , φ p , the values in φ must be correctly permuted into φ p .
2) Table Reordering : This section will cover our approach to achieve the column reordering detailed in Section III-A. As mentioned before, we do not store R, since each row r ∈ R can be retrieved from its index with the above detailed technique, hence the computation of R p will not be covered. On the other hand, for any φ[k] at index k in φ it is necessary to compute its index k p in the preprocessed In what follows, we show a more efficient approach to calculate k p . For simplicity, we first explain how to compute the index resulting from swapping the variables at positions i and j. Then, we provide an algorithm to compute k p by means of a sequence of swaps. 
for all a i , b i ∈ S do {For every swap in S} 4:
(2e)
.
Proof:
The proof is in the supplementary material. Proposition 1 is then used to reorder any potential table T according to the layout detailed in Section III-A. More formally, let S = S 1 , . . . , S n be a sequence of n swaps, each represented by an ordered 4 4 We assume that, for every pair a i , b i , a i > b i . 5 Swapping x 3 and x 2 is not necessary since neither of them belongs to Q 12 , but it has been included in our example to better explain the algorithm. We now analyze the impact of this preprocessing phase on the overall performance of the algorithm, by showing how it is more efficient than the naive approach mentioned above.
3) Computational Complexity: Proposition 2: Algorithm 2 has a time complexity of O(|φ||S|) ≤ O(|φ| |Q 12 | /2) < O(|φ||Q|).
Proof: The proof is in the supplementary material.
In our experimental evaluation, we performed the variable ordering with an average of |S| = 3 swaps, resulting in an improvement of an order of magnitude with respect to the naive approach, which, in contrast, requires tens of operations for each row. It is important to note that this preprocessing phase is done once for all, while compiling the BN in the corresponding JT. In fact, the acquisition of new evidence does not change the structure of the network itself, hence we can avoid to reorder each potential table at each BP by storing and updating the couple of corresponding reordered tables for each separator. Even if Algorithm 2 does not reorder φ in-place, the additional space required to store φ p is amortized by discarding the original table, since it is not needed in any subsequent phase of the algorithm. Furthermore, each iteration of the external loop (lines 1-7) of Algorithm 2 is independent and can be computed in parallel. As a consequence, the worst-case time complexity of the parallel version of Algorithm 2 is O( |φ||S| /t), where t is the number of threads. Given a JT = (V, E), our algorithm needs to store a couple of potential tables for each separator. Since threads can index input rows on-the-fly, mapping tables can be avoided. Thus, the memory requirements are O(2 · |E|). In contrast, the approach proposed by Zheng and Mengshoel [18] maintains one potential table for each clique, but it needs two mapping tables for each separator table. Hence, it requires O(V + 2 · |E|) tables.
B. GPU Kernel Implementation
In our approach to BP on GPUs, each block of threads is responsible for one element of the separator table, which is associated to a corresponding group of rows in potential tables. Such high-level organization of the computation allows us to carry out the entire reduction and scattering stages within a single thread block, hence avoiding any costly interblock synchronization structure. On one hand, the performance of our algorithm clearly benefits from the lack of interdependence among different blocks, which would reduce the overall computation parallelism. On the other hand, since the size of thread blocks has an intrinsic limit imposed by the hardware architecture (e.g., 2048 threads in Kepler GPUs), the proposed organization may serialize part of the workload if the number of rows to manage exceeds such limit. Nevertheless, such an issue is not problematic in our test cases, since the above mentioned case rarely verifies. In fact, in our experimental evaluation, each block has to reduce an average of 14 elements, 6 hence allowing a full parallelization.
If the serialization is small (i.e., each thread has to reduce and scatter few rows), the effect on the overall performance is negligible. This is due to the fact that the task is computed extremely efficiently in thread-private memory space using registers. In what follows, we explain the actual implementation of the above mentioned concepts in detail.
1) Reduction:
Once the input data is in the shared memory, the kernel starts the reduction phase that, in our approach, is implemented with the NVIDIA CUB library 7 by means of a block reduce raking algorithm. The algorithm consists of three steps: 1) an initial sequential reduction in registers (if each thread contributes to more than one input), in which warps other than the first one place their partial reductions into shared memory; 2) a second sequential reduction in shared memory, in which threads within the first warp accumulate data by ranking across segments of shared partial reductions; and 3) a final reduction within the raking warp based on the Kogge-Stone algorithm [27] produces the final output. This scheme is particularly efficient, since it involves a single synchronization barrier after the first phase it incurs zero bank conflicts 8 for primitive data types. On newer CUDA architectures (e.g., NVIDIA Kepler), such implementation exploits shuffle instructions, which are a new set of primitives provided by the CUDA programming language. Shuffle instructions enable threads within the same warp to exchange data through direct register accesses, hence avoiding shared memory accesses and improving the computational throughput of the algorithm.
In particular, such scheme is collectively performed by the block of threads associated to a particular element of the separator table, in order to compute its updated value as the sum of the corresponding rows of the first potential tables, i.e., the ones with a matching variable assignment. Once the reduction of the entire chunk has been completed, the first thread computes the value of R ij assigned to the considered block, which serves as input for the subsequent scattering phase of BP.
2) Scattering: The final stage of BP consists of the scattering operation, which performs the actual update of T p 2 by means of R ij computed in the above mentioned phase. The implementation of such operation benefits from the proposed memory layout, since it is realized with maximum parallelism and computational throughput. Each row of T p 2 is assigned to one thread, which multiplies its current value for R ij , computed in the reduction phase. Once the kernel has been executed by all blocks, the propagation of belief has completed the inclusion of new evidence in T p 2 , which can be finally transferred back to the CPU memory.
Notice that our method of computing the message passing phase of BP does not affect the semantic of the algorithm, hence the results computed by our techniques are correct.
It is important to note that all the techniques described in the current section are based on the fact that, in BP on JTs, tables are complete, i.e., they contain all the possible variable assignments. Therefore, this approach cannot be directly applied to solve COPs with BE, where, as explained in Section II-A, constraints make some variable assignments unfeasible and, hence, tables are incomplete. To overcome this issue, we propose a generalized approach able to process incomplete tables, as explained in the following section.
IV. PROCESSING INCOMPLETE TABLES
In this section, we elaborate our contribution to the parallelization of the BE algorithm to solve COPs. We propose a novel method to implement the join sum and the maximization operations (line 13 of Algorithm 1) on GPUs, to tackle their computational complexity and speedup the execution of BE.
A. Table Preprocessing
In order to explain our approach, we contextualize it on the example introduced in Section II-A1 (Fig. 1) . The goal is to rearrange the rows of these tables so to have the same final placement discussed in Section III-A, since the current data organization suffers from very poor data locality.
In particular, the preprocessing of these tables aims at achieving two fundamental requirements: 1) rows of the same color should be in consecutive memory addresses, to have full coalescence in memory accesses and to reduce the sparsity of data and 2) colored groups should be in the same order (considering the set of shared variables) in both tables, to locate them efficiently when computing the join sum result. This is required since tables can be incomplete. We achieve these objectives by means of Algorithm 3. The first step of the preprocessing phase requires to move the |Q 12 | columns corresponding to the shared variables to the |Q 12 | least significant 3 (LS) places. Notice that this step is an embarrassingly parallel [10] task, and it can be trivially divided among |R| threads, each independently processing a single row. Subsequently, the algorithm reorders R and φ by means of an LSD radix sort algorithm [15] (implemented with the NVIDIA CUB library). 7 It is not necessary to adopt a radix sort algorithm in this phase (as every sorting algorithm that operates on the basis of the LS |Q 12 | places would work). However, we decide to use such an algorithm since it can be parallelized very efficiently [15] . As a final step, the algorithm remove the nonmatching groups of rows (white rows in Fig. 1 Fig. 7 . Notice that none of these three steps require to have an entire table stored in the global memory, thus it is possible to easily split the input tables into manageable chunks meeting the memory capabilities of the GPU and preprocess them. 9 We now discuss how it is possible to exploit this row layout to index these tables and have multiple thread efficiently locate their input to compute the join sum in parallel on the GPU.
B. Join Sum GPU Computation
Join sum implements the composition operation of BE. Our method adopts a gather paradigm [10] , in which each thread is responsible for the computation of exactly one element of the output. Such a paradigm offers many advantages with respect to the counterpart approach, i.e., the scatter paradigm, in which each thread is associated to one element of input and contributes to the computation of many output elements. Scatter-based approaches have a reduced independence of the operations and they require atomic primitives (which serialize parts of the computation) to avoid race conditions, when multiple input elements concurrently modify the same output element.
In our particular case, one thread computes one particular output row at index i (i.e., both R[i], the variable assignment part, and φ[i], the value part), on the basis of the two input rows associated, which can be identified as explained below. First, we compute the number of rows in each colored group for T 1 and T 2 . As a result, we obtain a tuple H such that H[i] is the number of rows of the ith colored group. This operation can be seen as the computation of the histogram of the rows of the tables, which is a well-know primitive that can be parallelized very efficiently. In the above example, H 1 = 1, 1, 2 and H 2 = 2, 2, 2 . These histograms are also useful to compute the number of rows of the result table, a crucial information when we have to allocate the exact amount of memory to store the result. Each group of output rows has a number of elements equal to the product of the numbers of rows of the corresponding input groups. Hence, the histogram of the result table, namely H ⊕ , is computed as the element-wise product 10 (denoted as * ) of the input histograms. It is easy to verify that 1, 1, 2 * 2, 2, 2 = 2, 2, 4 is the histogram of the result 9 If it is necessary to sort a table larger that the GPU global memory, it is possible to split it into chunks, sort each of them (using the above mentioned radix sort algorithm), and then merge the sorted chunks (adopting the merge sort algorithm) on the CPU. 10 Element-wise product is an embarrassingly parallel operation.
Algorithm 4 Join Sum
Compute the join sum of input rows at indices idx 1 and idx 2 in T 1 and T 2 8:
Store the results in Fig. 2 . The sum of the values of such histogram is the total number of rows of the result table.
These histograms also allow each thread to efficiently locate its input rows, as well as the index of the output row it is responsible for, by indexing the colored groups in T 1 and T 2 . As a first step, we compute the exclusive prefix sum of the input and output histograms, which can be done very efficiently on the GPU [16] and, in our case, it is implemented with the NVIDIA CUB library. Given an histogram H, its exclusive prefix sum P is a tuple in which each element P [i] represents the index of the first row of the ith colored group. With these data structures, each thread can compute its row in the join sum result. Algorithm 4 represents the actual kernel function executed by the GPU, which receives as inputs the histograms of T 1 and the output histogram (i.e., H 1 and H ⊕ ), 11 as well as the corresponding prefix sum tuples (i.e., P 1 , P 2 , P ⊕ ). The variable assignment and the value parts of the output table are, respectively, denoted as R ⊕ and φ ⊕ .
It is important to note the absence of divergence in Algorithm 4, thanks to the fact that the only branch instruction (line 3) is used to limit the number of running threads to the amount needed, i.e., H ⊕ [bx]. For the sake of clarity, we made a number of simplifications in Algorithm 4. First, here we do not explicitly mention the use of shared memory, which is used to exploit the data reuse 12 inherent to the join sum operation, so to avoid unnecessary memory accesses to the global memory. The properties of these memory transfers between shared and global memory are discussed in Section V-A. Furthermore, we assume that each colored group of rows is computed by exactly one block of threads. In contrast, our actual implementation realizes a dynamic load balancing by assigning the appropriate number of groups to each block, in order to achieve a higher GPU occupancy and computational throughput. This number is determined by the amount of shared memory of the GPU and the maximum number of threads per block. It is possible that a single group of rows is larger than the available shared memory: this case is managed by splitting such a group into a number of subgroups, once again with the objective of maximizing the GPU occupancy.
C. Maximization GPU Computation
In this section, we describe how we implement the maximization (which corresponds to the marginalization operation of BE) on GPUs, exploiting the data layout discussed in Section IV-A. In particular, we adopt the same preprocessing phase detailed by Algorithm 3, with the sole difference that the set of shared variables is represented by all the variables in the scope of the table, excluding the one we want to marginalize. Intuitively, this corresponds to move such a variable to the MS place. In this case, if we compute the histogram H and the exclusive prefix sum P as previously described, we are able to index the groups of rows that must be considered when computing the maximum for the output, denoted by R max and φ max for the variable assignment and the value part. Algorithm 5 implements the maximization kernel. Algorithm 5 shows a simplified version of our actual implementation, in which we generate the appropriate number of threads and blocks on the basis of the size of the input. In contrast with the join sum operation, we do not have any data reuse (i.e., each input row is accessed by exactly one thread), hence the use of shared memory is not necessary. As a final performance remark, notice that the maximization of the H[tx] elements at line 5 is sequentially executed by the thread. Nevertheless, this aspect has a negligible impact of the computational throughput of our approach, since H[tx] depends on the size of the domain of the marginalized variable 2 and, in our experiments, it is usually a small value. When the considered variable has a binary domain, line 5 collapses to one single max operation.
Notice that Algorithms 4 and 5 correctly implement the ⊕ and ⇓ operations of BE, hence preserving its optimality.
V. DATA TRANSFERS
In the following sections we detail how our technique allows an optimized management of data transfers, thanks to full memory coalescence and pipelining.
A. Global-Shared Memory Transfers
Memory hierarchy in GPUs follows a widely adopted design in modern hardware architectures, in which very fast but smallsize memories (i.e., registers, cache, and shared memory), which are designed to assist high-performance computations, are stacked above a slower but larger memory (i.e., global memory), suitable to hold large amounts of rarely accessed data. In particular, shared memory resides on each SM and can deliver 32 bits per two clock cycles. To increase performance, it is mandatory to exploit such a low latency memory to store information that needs to be used very often. On the other hand, accessing global memory is particularly costly (400-800 clock cycles), and should be minimized to achieve a good compute-to-memory ratio.
A common programming pattern suggests to split input data into tiles that fit into shared memory (i.e., 48 KB of information) and to complete all the computational tasks using only such data. This allows to minimize global memory accesses for each kernel execution. Coalesced accesses are the optimal way to carry out such data transfers, which is related to the principle of spatial locality of information.
Memory coalescing refers to combining multiple transfers between global and shared memory into a single transaction, so that every successive 128 bytes (i.e., 32 single precision words) can be accessed by a warp (i.e., 32 consecutive threads) in a single transaction. In general, sparse or misaligned data organization may result in uncoalesced loads, serializing memory accesses, and reducing the performance, while consecutive and properly aligned data chunks enable full memory coalescing (Fig. 8) . Thanks to the previously explained preprocessing phase, the portion of input data needed by each thread block is read from global memory with fully coalesced memory accesses, since such data is already organized in consecutive addresses. The transfers are further optimized using vectorized 13 memory accesses in order to increase bandwidth, reduce instruction count, and improve latency.
B. Host-Device Data Transfers
The table layout presented in Sections III and IV allows tables to split into several data segments and threads to independently operate in each segment. This leads to a twofold improvement.
1) We devise a pipelined flow of smaller copy-and-compute operations, by amortizing the cost of CPU-GPU data transfers on the overall algorithm performance. 2) We can process tables that do not fit into global memory, by breaking them into more manageable chunks. This allows our approach to perform BE even on problems that were intractable for previous approaches [17] , [18] .
1) Pipelining:
The standard pattern of GPU computation requires the whole input to be transferred to the global memory before starting the kernel execution. The results are then copied back to the host memory. Such synchronous approach can be improved if the kernel can start on a partial set of input data, while the copy process is still running. Fig. 9 shows the proposed pipelined model of computation, in which a single GPU message passing operation has been split into four stages (marked by different colors). Each computation kernel K i executes as soon as the corresponding input 13 Vectorized memory instructions compile to single LD.E.128 and ST.E.128 instructions to transfer chunks of 128 bits at a time. data subset has been transferred by means of H → D i . This solution applies to GPU architectures that feature only one copy engine (i.e., data between host and device can be transferred through a single channel only). Data segments for table processing are necessarily serialized, thus allowing overlapping between one kernel execution and one data transfer only. In our experiments focusing on BP, we found that, in average, this approach achieves a performance improvement of 50% with respect to synchronous data transfers. Most recent and advanced GPUs (e.g., NVIDIA Kepler) feature an additional copy engine, which enables a further degree of parallelism between data transfers and computation. On these architectures, this approach exploits the supplementary channel to overlap input and output data transfers (see Fig. 10 ). Such a fully pipelined model of computation achieves, in average, a performance improvement of 75% with respect to synchronous data transfers, achieved in our experiments on WCSPs.
2) Large Tables Processing: Our technique can be applied to execute BE-based algorithm even when tables do not fit into the GPU global memory, by splitting large tables into smaller data structures. In particular, this division is achieved by computing the maximum number of kernels, namely max s , which can execute at the same time without exceeding the memory capabilities of the device. In our implementation, max s is dynamically determined at runtime as the maximum number of kernels whose total amount of input and output data can be stored into global memory. In addition, we also take into account the space constraints deriving from the use of shared memory (see Section IV-B), by enforcing that single colored chunks of data can fit in such memory. Fig. 11 shows an example, in which max s = 2. Each kernel K i is enqueued in stream i mod max s . Transaction H → D 3 cannot be scheduled in parallel with D → H 2 (unlike the example of Fig. 10 ), as it would violate the above mentioned memory constraint. Thus, one time slot is skipped in order to complete the copy D → H 1 and to free an adequate amount of memory before starting H → D 3 . The serialization of these two operations is a direct consequence of their execution in the same stream (i.e., stream 1). Even though the hardware constraints limit the size of data to be processed, the proposed approach allows oversized tables to be processed in multiple steps, improving scalability. As an example, 5 out of 7 instances in the considered WCSP dataset (see Section VI-B) cannot be solved without this capability of handling large tables.
VI. EXPERIMENTAL RESULTS
In order to evaluate appropriately our approach presented in Sections III and IV, we conducted two different set of experiments, discussed, respectively, in Sections VI-A and VI-B. First, we discuss the results obtained executing BP on JTs (using complete tables), then we test our approach that handles incomplete tables to solve WCSPs. Our approach is implemented in CUDA 14 and all our experiments are run on a machine with an AMD A8-7600 processor, 16 GB of memory, and an NVIDIA Tesla K40.
A. BP on JTs
In this section we benchmark the approach described in Section III, i.e., CUDA-BP, which exploits the completeness of tables to achieve a better indexing of potential tables when executing BP on JTs. We compare our approach with the best approach (i.e., the SVR regression model) published by Zheng and Mengshoel [18] , using the authors' implementation. We use the same BN dataset, 15 which comprises various BNs with heterogeneous structures and variable domains. We compile each BN into a JT, which is then used as input for both approaches in order to guarantee a fair comparison. Table I details some features of our JTs, i.e., the number of JT nodes resulting from their compilation and the minimum, maximum and average size of the potential and separator tables. Following Zheng and Mengshoel [18] , the compilation of these networks into the corresponding JTs has been done offline, before the execution of the BP algorithm. For this reason, it has been excluded from the runtime measurements. Table II reports the runtime in milliseconds corresponding to the following phases of the BP on JTs algorithm: 1) the total time required to complete all the reduce and scatter phases in the sequential version; 2) the total time required to preprocess all potential tables using our technique; 3) the total time required to complete the data transfers between the host and the device; 4) the total time required to complete all the reduce and scatter phases in our GPU approach; 5) our GPU speedup; 6) the total time required to complete all the reduce and scatter phases in the GPU approach by Zheng and Mengshoel [18] based on the SVR regression model; and 7) Zheng and Mengshoel's speedup [18] . Since the preprocessing phase must be done only once and can be avoided when any new evidence is received and propagated, it has not been considered in the calculation of the speedup. Moreover, we do not consider the runtimes relative to data transfers in such calculation, as such time is amortized thanks to our pipelining technique (Fig. 10) . For a fair comparison, transfers are also excluded when calculating the speedup for Zheng and Mengshoel's approach [18] . In our tests, our algorithm outperforms the counterpart in the majority the instances, i.e., all except in the Water network, where runtimes are comparable. In more detail, our approach achieves speedups at least 56% higher than the counterpart in the Barley dataset (i.e., 33.03× versus 21.14×). Our best improvement with respect to the counterpart happens on the Mildew network, where our approach runs 39× faster than the CPU version, and it produces a GPU speedup that is the 466% higher than the counterpart. In general, our approach produces speedups that increase when the average potential table size increases (see Table I ). In fact, we achieve speedups less than 10× only with small instances (i.e., Munin 2 and Munin 3 ).
B. WCSPs
WCSPs involve incomplete tables, as they contain some unfeasible variable assignments. Thus, we apply the approach in Section IV. We consider the SPOT5 dataset [14] , a standard dataset that models the problem of managing an Earth observing satellite as a WCSP, to maximize the importance of the captured images, while satisfying some feasibility constraints.
The main objective of these experiments is to evaluate the speedup that can be achieved adopting our parallel approach, which is compared to a sequential BE version that uses a simple implementation for the join sum and the maximization operations. Our speedup is compared with the one achieved by the approach by Fioretto et al. [24] (i.e., the most recent GPU implementation of the BE algorithm) considering the same sequential BE implementation. The counterpart approach is implemented using the source code provided by the authors. Table III shows the runtime in seconds (including preprocessing and data transfers) needed to solve the instances of our reference domain by our parallel approach, i.e., CUDA-BE, compared to its sequential version, i.e., BE. Such table also reports the number of variables and the induced width of these instances. The results show that CUDA-BE provides a speedup of at least two orders of magnitude with respect to the sequential algorithm, by reaching a maximum of 696.02×. Such speedup increases consistently with the size of the instances (i.e., the induced width and the number of variables), showing that the proposed approach correctly exploits the increased amount of parallelism in bigger tables. In fact, the speedup provided by CUDA-BE monotonically increases in the first three WCSP instances (i.e., 54, 29, and 404), in which both the number of variables and the induced width increase. On the other hand, such speedup decreases in instance 503, which, despite having a larger number of variables, is characterized by a lower induced width. Notice that the induced width has a stronger influence on the complexity of the problem [2] . The ability of our method of handling large tables is crucial in this scenario. In fact, 5 out of 7 instances (i.e., 404, 503, 42b, 505b, and 408b) cannot be solved without this feature, as their tables exceed the amount of GPU memory. Finally, notice that our approach outperforms the approach by Fioretto et al. [24] both in terms of runtime and scalability. On the one hand, our approach is, on average, 3.21× faster than the counterpart in the solution of the first four instances. On the other hand, the counterpart approach cannot handle the three biggest instances of the SPOT5 dataset, probably due to the different representation of the tables in memory.
VII. CONCLUSION
In this paper, we considered the BE framework and we proposed an efficient and scalable highly-parallel approach that is able to harness the computational power of modern GPUs by means of an appropriate data organization in memory. The proposed approach applies also to problems involving incomplete tables, i.e., tables that do not contain all variable assignments, as well as problems that do not fit into the global memory of the GPU. Furthermore, it enables pipelined data transfers between host and device, thus further improving performance. Our experimental results show that our approach outperforms the state-of-the-art approach for BP on JTs proposed by Zheng and Mengshoel [18] , by obtaining speedups ranging from +56% to +466%. The tests on WCSPs confirmed the ability of our technique to achieve a high computational throughput (reaching a speedup of 696.02× with respect to the CPU version), and proving the importance of its ability to process large tables, a necessary feature to solve these instances.
Future work will look at integrating the proposed GPU techniques in other algorithmic frameworks, such as AND/OR search-based approaches [21] , in which mini-BE heuristics [2] are used to guide the search.
