In this paper, we present parallel multilevel algorithms for the hypergraph partitioning problem. In particular, we describe schemes for parallel coarsening, parallel greedy k-way refinement and parallel multi-phase refinement. Using an asymptotic theoretical performance model, we derive the isoefficiency function for our algorithms and hence show that they are technically scalable when the maximum vertex and hyperedge degrees are small. We conduct experiments on hypergraphs from six different application domains to investigate the empirical scalability of our algorithms both in terms of runtime and partition quality. Our findings confirm that the quality of partition produced by our algorithms is stable as the number of processors is increased while being competitive with those produced by a state-of-the-art serial multilevel partitioning tool. We also validate our theoretical performance model through an isoefficiency study. Finally, we evaluate the impact of introducing parallel multi-phase refinement into our parallel multilevel algorithm in terms of the trade off between improved partition quality and higher runtime cost.
Introduction
Intelligent a priori data partitioning enables the efficient parallelisation of many sparse irregular problems by reducing interprocessor communication while maintaining computational load balance. Graph and hypergraph partitioning decomposition models have been widely used in this context, in particular in the fields of VLSI circuit design [3, 36, 46] and matrix decomposition for parallel computation [4, [12] [13] [14] [15] 20, 21, 50, 51] . Hypergraph models are in general preferred to graph models due to their greater expressiveness that overcomes the well documented limits of the graph partitioning approach [12, 13, 20, 31, 32] .
Serial graph and hypergraph partitioning algorithms have been studied extensively [2, 3, 5, 11, 17, 23, 24, 27, 33, 35, 36, 38, 40, 42, 44, 45] . Many of these are based on the multilevel approach [2, 5, 33, 35, 36, 38, 40] , which has three main phases. Firstly, during the coarsening phase, the original (hyper)graph is coarsened to successively smaller (hyper)graphs. Next, during the initial partitioning phase, the smallest (hyper)graph in this sequence is partitioned. Finally, during the uncoarsening phase, this partition is projected back through the sequence of successively larger (hyper)graphs onto the original (hyper)graph, with heuristic refinement applied at each step.
Serial hypergraph partitioning algorithms are limited by the computing power and memory capacity of a single workstation. In this paper, we address this limitation by describing the first parallel algorithms for the hypergraph partitioning problem. Our parallel algorithms are based on the multilevel paradigm, specifically a parallel coarsening algorithm as well as two parallel refinement algorithms -a parallel direct k-way scheme and its parallel multi-phase adaptation. We also demonstrate scalability of the parallel multilevel algorithms under a theoretical performance model and the assumption of low maximum vertex and hyperedge degrees. The parallel coarsening algorithm, the parallel direct k-way refinement algorithm and their theoretical scalability analysis were first described in our earlier conference papers [48, 49] . These are presented here in distilled and refined form to make the paper self-contained. We further evaluate the empirical performance of our parallel algorithms on a Beowulf cluster, partitioning a number of hypergraphs from application domains ranging from biology to VLSI circuit design. We test the scalability of our parallel algorithms' runtime and partition quality, and estimate the expected improvement in partition quality when applying parallel multi-phase refinement.
The remainder of this paper is organised as follows. Section 2 outlines the preliminaries and background material on serial hypergraph partitioning algorithms. Section 3 presents our parallel multilevel hypergraph partitioning algorithms and theoretical scalability analysis. Section 4 presents the experimental evaluation and Section 5 concludes.
Related Work

Problem Definition
Formally, a hypergraph is a set system (V, E) on a set V , here denoted H(V, E), such that E ⊂ P(V ) \ {∅}, where P(V ) is the power set of V [7] . We call V the set of vertices and E the set of hyperedges. Hypergraphs arising from partitioning problems are weighted, in that a scalar weight is associated with each vertex and hyperedge. When every hyperedge in a hypergraph has cardinality two, the resulting set system is better known as a graph. When partitioning problems lead to identical hyperedges in the hypergraph model, we replace these with a single hyperedge whose weight we set to be the sum of the weights of the identical hyperedges.
We say that a hyperedge e (vertex v) is incident on a vertex v (hyperedge e) if and only if v ∈ e. Vertices u, v ∈ V are adjacent if and only if there exists a hyperedge e ∈ E such that u ∈ e and v ∈ e. The degree of a vertex (hyperedge) is the number of hyperedges (vertices) incident on that vertex (hyperedge). The vertex-edge incidence matrix [6] The vertices in a hyperedge are also called its pins and the total number of pins in the hypergraph is given by the number of non-zeros in the incidence matrix A. Hypergraph partitioning seeks a partition of the hypergraph that optimises an objective function subject to balance constraints. A k-way (k > 1) partition Π ⊂ P(V ) of the hypergraph H(V, E) is a finite collection of subsets of V (or parts), such that Π = {P 1 , . . . , P k }, P i ∩ P j = ∅ for all 1 ≤ i < j ≤ k and
In the domain of VLSI Computer-Aided Design, a common problem involves dividing system components into clusters such that cluster interconnect is minimised. The vertices of the hypergraph can be used to represent the components of the circuit and the hyperedges can be used to represent the nets connecting these components [46, 3] . Similarly, the decomposition of a sparse matrix across processors for parallel sparse matrix-vector multiplication may be modelled by a number of hypergraph models that correctly quantify the total communication volume [12] [13] [14] [15] 50, 51] .
The partitioning objective function f o (Π) is usually defined to be a cut metric on the hyperedges. We say that a hyperedge e ∈ E is cut by a partition Π if there exist at least two vertices v, w ∈ e such that they have been allocated to distinct parts. The number of distinct parts that the vertices of the hyperedge have been allocated to gives the number of parts spanned by the hyperedge. The two objective functions most commonly occuring in hypergraph partitioning applications are the hyperedge cut (defined in equation 2) and the k − 1 (defined in equation 3) objectives.
In equations 2 and 3, λ i denotes the number of parts spanned by hyperedge e i under the partition Π, while w(e i ) represents the weight of hyperedge e i . Note that when a two-way partition is sought (often called (hyper)graph bisection in partitioning literature), the k − 1 objective reduces to the (hyper)edge cut objective. We note further that, from a partitioning point of view, it is safe to ignore hyperedges with cardinalities less than two.
The partitioning constraint in defined in terms of part weights. The weight W i of a part P i ∈ Π is given by the sum of the weights of its constituent vertices. Given a prescribed balance criterion 0 < < 1, the goal is to find a partition Π = {P 1 , . . . , P k } such that:
Background on Hypergraph Partitioning Algorithms
Finding an optimal hypergraph bisection is NP-Hard [28] . It follows that the problem of finding an optimal k-way partition is also at least NP-Hard. Thus, research effort has been focused on developing polynomial-time heuristic algorithms that give good sub-optimal solutions. Performance of the algorithms in terms of run time and solution quality is usually evaluated using suites of benchmark hypergraphs [1, 9] . A k-way partition of a hypergraph H(V, E) is either constructed directly or by the recursive bisection of H(V, E). A comprehensive survey of different heuristic approaches to hypergraph partitioning is presented in [3] .
Iterative Improvement Algorithms
In hypergraph partitioning literature, iterative improvement algorithms have been preferred to other well-known optimisation techniques such as simulated annealing and genetic algorithms because they have the potential to combine good sub-optimal solutions with fast run times [3] . These begin with a feasible solution and iteratively move to the best neighbouring feasible solution. The algorithms terminate when they reach a feasible solution for which all neighbouring feasible solutions do not improve the objective function. The initial feasible solution can be randomly selected, or greedily constructed around a randomly chosen vertex.
Successful iterative improvement algorithms for hypergraph bisection have been based primarily on the Kernighan-Lin (KL) [42] or Fiduccia-Mattheyses (FM) algorithms [27] . These algorithms rely on a priority queue of vertex moves to greedily select the best vertex move (in the case of the FM algorithm) or the best vertex swap (in the case of the KL algorithm) in terms of the objective function. They proceed in passes, during each of which each vertex is moved at most once. Vertex moves resulting in negative gain are also possible, provided that they represent the best feasible move at that point. A pass terminates when none of the remaining vertex moves are feasible. The gain of a pass in terms of the objective function is then computed as the best partial sum of the gains of the individual vertex moves that are made during that pass. The algorithm terminates when the last completed pass does not yield a gain in the objective function. The low computational complexity of the FM algorithm (O(z) per pass, where z is the number of pins in the hypergraph) follows from the way in which the priority queue storing the remaining vertex moves is maintained during a pass. In practice, the FM algorithm converges in a few passes and is thus quoted to run in O(z) time.
The main disadvantage of the above algorithms is that they make vertex moves based solely on local information (the immediate gain of the vertex move). Enhancements to the basic algorithms that attempt to capture global properties or that incorporate look-ahead have been proposed [23] [24] [25] 44] . The FM algorithm has also been extended so that it can directly compute a k-way partition [45] (a k-way extension to the KL algorithm was first proposed in the original paper by Kernighan and Lin [42] ). In [17] , it was observed that this k-way formulation of the FM algorithm is dominated by the FM algorithm implemented in a recursive bisection framework; hence an enhanced k-way algorithm based on a pairwise application of the FM algorithm was proposed.
Multilevel Paradigm
So-called flat partitioning algorithms (i.e. those that operate directly on a given hypergraph) suffer substantial degradation in run time and solution quality as the size of the problem increases [3] . Algorithms based on the multilevel paradigm are therefore preferred to flat approaches [2, 5, 10, 33, 35, 36, 38, 40] . In the multilevel paradigm the original hypergraph H(V, E) is approximated by successively smaller hypergraphs 
− → H) and for each intermediate hypergraph (including the original hypergraph H) the partition is further refined using an iterative improvement algorithm during the uncoarsening or refinement phase.
The coarsening phase consists of a number of steps, during each of which a coarser representation
This is performed by merging together the vertices of the hypergraph H i to form vertices of the coarse hypergraph H i+1 . We represent this by the map g i : V i → V i+1 , where
and r i is the prescribed reduction ratio. It is usual to set r i to be the same for all i. Given that a set of vertices A ⊂ V i maps to a single vertex v ∈ V i+1 , the weight of v is set to the sum of the weights of the vertices in A. The map g i is used to construct E i+1 from E i by applying it to every vertex of each hyperedge e ∈ E i . When the set A ⊂ V i that maps onto a single vertex v ∈ V i+1 represents all the vertices from a hyperedge in E i , the corresponding hyperedge in E i+1 will consist of a single vertex. These single vertex hyperedges in E i+1 are discarded, as they will span at most one part. It is possible that a set of (distinct) hyperedges B ⊂ E i yields a set of identical hyperedges B ∈ E i+1 , in which case B is replaced by a single hyperedge whose weight is set to be the sum of the weights of the hyperedges in B. The way in which the weights of the vertices and hyperedges in H i+1 are calculated ensures that when a partition Π i+1 of the hypergraph H i+1 is projected onto a partition Π i of H i , we have that f o (Π i+1 ) = f o (Π i ) under both the k − 1 and hyperedge cut metrics; furthermore, the respective part weights are also preserved.
The coarsening algorithm should ensure that a good partition of a coarser hypergraph H i , when projected to the original hypergraph H, is also a good partition relative to the optimal partition of H. Coarsening algorithms are discussed in detail in [3, 35] . These try to merge together strongly connected vertices, where connectivity between vertices u, v ∈ V i is quantified in terms of the set of hyperedges incident on both u and v. There are also coarsening algorithms that derive the coarse vertex set V i+1 by first identifying a set of mutually independent hyperedges (i.e. hyperedges that share no common vertices) from E i and then merging together vertices in each hyperedge from this set to form vertices in V i+1 [36] . Most of the commonly used coarsening algorithms are implemented within the PaToH serial multilevel hypergraph partitioning tool [16] .
The aim of the initial partitioning phase is to construct a good partition of the coarsest hypergraph H c (V c , E c ). Since H c is assumed to be significantly smaller than H, time spent in the initial partitioning phase should be dominated by time spent in the other phases of the multilevel framework. Multiple randomlyseeded runs of an iterative improvement algorithm are often used [13, 35, 36, 38] . In a graph partitioning context, more computationally expensive initial partitioning algorithms such as spectral partitioning have also been used [5, 33] . Multiple partitions constructed during this phase may be propagated to the successively finer hypergraphs because it does not follow that the best partition of the coarsest hypergraph H c will always result in the best partition of a finer hypergraph H i , i < c [35] .
After projecting the partition Π i+1 of H i+1 onto Π i of H i during the uncoarsening phase, a heuristic refinement algorithm refines Π i . In the case where a bisection of H c is projected, a variant of the FM algorithm is usually used [13, 35, 36, 38] . When a k-way partition of H c (for a general k > 2) is projected, a randomized greedy refinement algorithm has been shown to yield partitions of good quality with fast run times [38] . This algorithm also proceeds in passes. During each pass, the set of vertices V i is traversed in random order. For each vertex v ∈ V i , the set of neighbouring parts N (v) (such that a part P ∈ N (v) if and only if there exists a vertex u ∈ V i adjacent to v, and u ∈ P ) is constructed and gains for moving the vertex to each part P ∈ N (v) are computed if moving v to part P does not violate the balance constraint on the partition. If there is at least one legal move of v that yields a gain in the objective function, the move resulting in the largest gain is made. Otherwise the vertex v is not moved. The algorithm terminates when the most recently completed pass does not improve the value of the objective function for the partition.
It is possible to utilise the entire multilevel framework to perform further refinement of a partition [36] . This multi-phase approach (also known as Vcycling) recursively applies the multilevel algorithm to the current hypergraph and its refined partition. Formally, suppose we have computed a partition Π i for the hypergraph 
, such that after each projection, the partitions are further refined, as in the standard uncoarsening algorithm. Successive calls to multi-phase refinement are terminated when the most recently completed V-cycle has not yielded an improvement in the objective function of the partition.
Concurrent Work on Parallel Hypergraph Partitioning Algorithms
Concurrent to our work, [19] presented a parallel multilevel hypergraph partitioning algorithm that uses a two-dimensional decomposition in which rectangular blocks of the incidence matrix are assigned to processors. Furthermore, the algorithm in [19] partitions the hypergraph by recursive bisection, unlike our algorithm which uses direct k-way partitioning.
Early indications are that this two-dimensional parallel algorithm may be more efficient than our one-dimensional algorithm for hypergraphs that do not have low maximum vertex/hyperedge degrees. This is because most of the global communication in the two-dimensional algorithm involves O( √ p) processors, whereas the one-dimensional algorithm uses all p processors. On the other hand, the quality of partition produced by our one-dimensional algorithm is generally observed to be better [19] . This may be partly due to the choice of refinement paradigm (recursive bisection versus direct k-way); we note that in principle it should be possible to adapt the k-way parallel refinement algorithm to a two-dimensional distribution of the hypergraph to processors.
Parallel Multilevel Partitioning Algorithm
This section describes our parallel hypergraph partitioning algorithm, based on the multilevel paradigm. We assume that the k−1 objective is to be minimised, although our approach should also generalise to minimising the hyperedge cut objective function. The target architecture for the parallel algorithm is a distributed-memory, message-passing architecture.
Despite the natural parallelism inherent in the splitting steps of a recursive bisection approach, we chose to parallelise the direct k-way algorithm from [38] . This is because variants of FM bisection refinement require that priority queue data structures are maintained after each vertex move to ensure that the sub-sequent highest-gain vertex move is easily found. Hence, after each vertex move, all the neighbouring vertices need to be informed of this move and their gains recomputed, leading to high communication overheads in a distributedmemory setting. By contrast, the serial k-way algorithm in [38] has been shown to be competitive when compared to the recursive bisection approach and requires no priority queue data structures.
Data Distribution
Letting p denote the number of processors, the hypergraph H(V, E) is distributed across the processors as follows. We store |V |/p vertices and |E|/p hyperedges on each processor. The vertices are allocated to processors contiguously, so that the first |V |/p vertices (in terms of their index) are allocated to the first processor, the next |V |/p vertices to the second processor and so on. For each vertex v ∈ V , its weight and current part index in the partition are stored on the processor holding v and similarly, for each hyperedge e ∈ E, its weight is stored on the processor holding e.
A randomised allocation of vertices to processors can help to improve load balance. This can be achieved by computing a pseudorandom permutation of the indices of the elements of the set V and then modifying the hyperedge set E by assigning to every vertex in each hyperedge a new index, as given by the permutation. However, randomisation removes structure that may help to reduce interprocessor communication. Whether or not to use randomised allocation therefores depend on the specific problem instance at hand.
For the first multilevel step, hyperedges are allocated to processors contiguously. In subsequent steps, we also associate a b-bit hash key, with each hyperedge e ∈ E, computed using a variant of the load balancing hash-function h : N a → N, from [43] , where a is the maximum hyperedge cardinality. This function has the desirable property that for an arbitrary set of hyperedges E, h(e) mod p, e ∈ E, is near-uniformly distributed. Consequently, in order to ensure an even spread of hyperedges across the processors while preserving the ability to eliminate duplicate hyperedges, each hyperedge e resides on the processor given by h(e) mod p. To calculate the probability of collision, assume that h distributes the keys independently and uniformly across the key space (i.e., that all M = 2 b key values are equally likely) and let C(N ) be the number of hash-key collisions among N distinct hyperedges. We then have
M , as shown in [43] . Suppose that, for example, |E| = 10 8 and b = 64. Then P(C(N ) ≥ 1) ≤ 0.0003 -ensuring that the probability of collisions is remote. This facilitates rapid hyperedge comparison, since given hyperedges e and e , h(e) = h(e ) implies that e = e . The converse does not hold, but collisions do not affect the correctness of the algorithm. When a collision occurs between hyperedges e, e ∈ E, entire sets e and e can be compared to determine that they are indeed different.
At the beginning of every multilevel step, each processor assembles the set of hyperedges that are incident on each of its locally held vertices using an all-to-all personalized communication. We refer to hyperedges replicated on multiple processors as frontier hyperedges. A map from the local vertices to their adjacent hyperedges is also built (so that both vertex-to-hyperedge and hyperedge-to-vertex maps are available). At the end of the multilevel step, the non-local assembled hyperedges are deleted together with the entire vertexto-hyperedge map.
Experience suggests that for hypergraphs with small maximum vertex degree, the memory overhead incurred by duplicating frontier hyperedges during a single multilevel step is modest; we report the percentage of the total number of pins of the hypergraph replicated in frontier hyperedges in our experiments in Section 4. We also note that even though vertex degree increases for the coarser hypergraphs (when compared to the original hypergraph H(V, E)), the first few hypergraphs in the multilevel sequence are considerably larger than the coarser hypergraphs, and thus the replication of frontier hyperedges is most significant (in terms of the amount of memory used) during the first few levels of the multilevel sequence. Further, memory overhead may be reduced by omitting large hyperedges from the coarsening computation; this has also been proposed in the context of serial partitioning in order to accelerate the coarsening computation [13, 16] .
Parallel Coarsening Phase
In this section, we describe our parallel coarsening algorithm (cf. Algorithm 1). At the beginning of each coarsening step, the processors first perform a parallel matching computation in order to construct the map g i ; this is described in detail later. To construct E i+1 , each processor needs to transform its locally stored |E i |/p hyperedges from E i using the map g i . A processor may store a
perform all-to-all communication of required values of g i
5:
apply g i values across locally stored hyperedges 6: compute destination processors for all e ∈ E i+1 using hash-function h
7:
perform all-to-all communication of hyperedges in E i+1 8: perform load balancing communication of V i+1 9:
hyperedge e ∈ E i with a vertex v ∈ e for which the processor does not store g i (v). Thus, required values of g i are first communicated by a personalized allto-all communication. Each processor then applies g i across each of the |E i |/p hyperedges stored on that processor. The removal of duplicate hyperedges in E i+1 and load balancing are done as follows. Processors communicate each hyperedge e ∈ E i+1 and its weight to the destination processor given by h(e) mod p. Each processor retains distinct hyperedges, setting their weight to be the sum of the weights of their respective duplicates (if any), since all identical hyperedges will possess the same hash key value and hence will have been communicated to the same processor. The parallel coarsening step concludes with a load-balancing communication of V i+1 such that each processor stores |V i+1 |/p vertices at the start of the subsequent coarsening step.
We now describe our parallel vertex matching algorithm, which is based on the First-Choice (FC) or the Heavy Connectivity Clustering (HCC) serial coarsening algorithm [38, 13] . Given a hypergraph H i (V i , E i ), the serial algorithm proceeds as follows. The vertices of the hypergraph are visited in a random order. For each vertex v ∈ V i , all vertices (both those already matched and those unmatched) that are connected via hyperedges incident on v are considered for matching with v. A connectivity metric is computed between pairs of vertices and the most strongly connected vertex to v is chosen for matching, provided that the resulting cluster does not exceed a prescribed maximum weight. The matching computation ends when |V i |/|V i+1 | > r, with r the prescribed reduction ratio.
Our parallel matching algorithm is summarised in Algorithm 2. Each processor γ first traverses its local vertex set in random order, computing vertex matches as in the serial algorithm. We have implemented the algorithm with the absorption connectivity metric [2, 36] :
Here w(e) denotes the weight of hyperedge e ∈ E i and w(u) and w(v) the weights of the two vertices u and v respectively. Processor γ also maintains a request set for each of the p − 1 other processors. If the best match for a local vertex u is computed to be a vertex v stored on processor ρ = γ, then the vertex u is placed into the request set S γ,ρ . If another local vertex subsequently chooses u or v as its best match, then it is also added to the request set S γ,ρ . The local matching computation terminates when the ratio of the initial number of local vertices to the number of local coarse vertices exceeds a prescribed threshold (cf. equation 5), or when all the local vertices have been visited. Our definition of local vertices includes clusters formed from tentative matches of local vertices with vertices stored on remote processors, as well as singleton clusters from local vertices being simply copied over to the coarse hypergraph. Note that the processors complete the local matching computation asynchronously (lines 1 and 2 in Algorithm 2) but synchronize prior to the communication step in line 3.
Communication steps resolve the vertex matching requests that span multiple processors (lines 3 to 21 in Algorithm 2). In order to enable a match between two vertices on different processors that make mutual requests to each other, the communication proceeds in two stages. In the first stage, processor γ communicates request sets S γ,ρ to processor ρ and then receives replies to its requests from ρ if and only if γ < ρ, while in the second stage processor γ communicates request sets S γ,ρ to processor ρ and receives replies to its requests from ρ if and only if γ > ρ.
The processors concurrently decide to accept or reject matching requests from other processors. Denote by M . This is necessary since granting this match may result in a vertex that exceeds the maximum allowed vertex weight if the remote match of v with a vertex on processor µ is granted. 
When informed of a match rejection by processor ρ, processor γ will cluster all the vertices in the set M v γ,ρ into a single coarse vertex (lines 11, 12, 21 and 22 in Algorithm 2). Our algorithm admits the possibility that unconnected vertices sharing a common neighbour are matched together during a single coarsening step. To see how this could happen, consider vertices u, v on processor ρ and w on processor γ, with u and w connected to v but not connected to each other. Processor ρ may match u and v together and then put a match request from γ to match v with w, yielding a coarse vertex that encompasses u, v and w if v and w are sufficiently highly connected for the remote match to be granted. This behaviour is allowed because even though u and w are unconnected, in most cases the three vertices u, v and w nonetheless form a natural cluster.
Serial Initial Partitioning Phase
The parallel coarsening phase terminates when the number of vertices in the coarsest hypergraph is below a pre-specified threshold α H (k) (in our implementation k times a user-specified constant) or the most recent reduction in the number of vertices |V i |/|V i+1 | does not equal or exceed a minimum required rate r min . We assume that the coarsest hypergraph H c (V c , E c ) is small enough for a partition to be rapidly computed on a single processor (when compared to the runtimes of the other phases of the algorithm). This is motivated by the observation that the parallel coarsening algorithm usually reduces the original sparse hypergraph by a few orders of magnitude (e.g. from 10 pins). Currently, our algorithm is not configured to partition the hypergraph in the scenario where the coarsest hypergraph cannot fit into the memory of a single machine. However, we foresee two possible approaches to solving this problem. The first is a parallel initial partition generator, while the second would enforce further clustering of the vertices with the sole purpose of reducing the hypergraph so that it can be partitioned serially on a single machine. In our implementation, H c (V c , E c ) is gathered on all processors and p runs of the serial algorithm are computed concurrently across the processors in parallel. The partition with the lowest cutsize is then projected through the parallel uncoarsening phase.
Parallel Uncoarsening Phase
At the beginning of each step of the parallel uncoarsening phase, we have
is not available, it is requested from the processor that stores g i (v) ∈ V i+1 . We set Π i (v) = Π i+1 (g i (v)). The frontier hyperedges are then assembled on each processor, as at the beginning of each parallel coarsening step. We then apply our parallel formulation of the greedy k-way serial refinement algorithm [38] .
The parallel refinement algorithm proceeds in passes, during each of which a vertex can be moved at most once; however, instead of moving individual vertices across a partition boundary, as in the serial algorithm, the parallel algorithm moves sets of vertices (since vertices will be moved concurrently across the processors). Each processor γ traverses its local vertex set in a random order and for each v ∈ V γ i , the legal move (if any) leading to the largest positive gain in the objective function is computed. When such moves exist, they are maintained in sets U γ i,j , i = j, i, j = 1, . . . , k, where i and j denote current and destination parts respectively. In order to reduce possible conflicts (e.g. movement of vertices in opposing directions such that their individual moves might result in a positive gain in objective function but when both are made, they in fact yield a non-positive gain), the refinement pass proceeds in two stages. During the first stage, only moves from parts of higher index to parts of lower index are permitted and vice versa during the second stage. Vertices moved during the first stage are locked with respect to their new part in order to prevent them moving back to their original part in the second stage of the current pass. We note that this does not guarantee that a sequence of (concurrently made) vertex moves will result in a positive gain. However, in our experiments, this parallel refinement algorithm produced partitions competitive with those produced by state-of-the-art serial tools.
The partition balance constraint (cf. equation 4) is maintained via global communcation. The alternative method, which precludes global communication, is to enforce a balance constraint locally on each processor such that the partitioning constaint in equation 4 is maintained overall. However, as the number of processors increases, the local constraints on each processor are likely to become too tight to allow the algorithm to sufficiently explore the feasible solution space; we instead choose to pay the price of possibly higher communication overhead for better refinement algorithm performance.
At the beginning of each of the two stages, the processors know the exact part weights and each processor γ maintains the balance constraint during its local computation of the set U γ i,j . The associated weights and gains of all the nonempty sets U γ i,j are communicated to the root processor (chosen arbitrarily) which then determines the actual partition balance that results from the moves of the vertices in the sets U γ i,j . If the balance constraint is violated, the root processor determines which of the moves should be taken back and informs the processors containing the vertices to be moved back. This is implemented as a greedy scheme favouring taking back moves of sets with large weight and small gain. Finally, the root processor broadcasts the updated part weights before the processors proceed with the subsequent stage. As in the serial algorithm, the refinement procedure terminates when the overall gain of a pass is not positive. Note that vertices need not be explicitly moved between processors; rather, their part index value can be changed by the processor that stores the vertex.
Parallel Multi-phase Refinement
This section describes our parallel multi-phase refinement algorithm, which consists of three multilevel phases, namely the parallel restricted coarsening phase, the serial initial partitioning phase and the parallel uncoarsening phase. The serial initial partitioning and the parallel uncoarsening phases are identi-cal to those described in Sections 3.3 and 3.4.
The parallel restricted coarsening phase takes a partition Π i of the hypergraph H i as input; vertices are only allowed to match with a neighbouring vertex that belongs to the same part within the partition, i.e., v ∈ V i can match with u ∈ V i if and only if Π i (u) = Π i (v). This can be incorporated within the existing parallel coarsening algorithm, since transitivity of the above condition (i.e. Π i (u) = Π i (v) and Π i (v) = Π i (w) imply Π i (u) = Π i (w)) ensures that the resolution of remote matching requests will result in the matching together of vertices allocated to the same part. Rather than follow this approach, we exploit additional concurrency afforded by the restriction. Vertices belonging to the same part are collected onto a single processor; this precludes further communication during the restricted coarsening phase since only vertices allocated the same part may match together. Processors then concurrently execute a serial coarsening algorithm (FC/HCC). The partition balance criterion in equation 4 should ensure that computational load balance across the processors is maintained during this phase. The drawback of this scheme is that load balance is only maintained if the number of parts in the partition k is a multiple of the number of processors p and k ≥ p. We note that this is a not a restriction on parallel multi-phase refinement per se. The construction of the coarse hypergraph H i+1 (V i+1 , E i+1 ) is done as in the parallel coarsening algorithm from Section 3.2.
Analytical Performance Model
In this section we present an analytical performance model of our parallel algorithms and derive the average-case asymptotic runtime T p , assuming that the underlying parallel architecture is a p-processor hypercube. We show that the algorithm is asymptotically scalable when the maximum vertex and hyperedge degrees in the hypergraph are small, and also derive its isoefficiency function [30] .
Let |V | = n and suppose that the numbers of vertices and hyperedges in the original hypergraph are of the same order of magnitude, so that we may write |E| = Θ(n). Further, let l and d denote the maximum hyperedge degree and the maximum vertex degree of the original hypergraph, and l i and d i denote the respective maximum degrees within hypergraph H i in the multilevel process. We assume that l and d are small constants, so that l n and d n, and also assume that the numbers of vertices and hyperedges are respectively reduced by constant factors 1 + α and 1 + ω (α, ω > 0) at each coarsening step. We discuss the appropriateness of these assumptions in Section 4. We let n i = max{|V i |, |E i |} and assume in our analysis that with increasing i these values become small when compared to the numbers of vertices/hyperedges in the original hypergraph. We have that l i ≤ l for all 0 ≤ i ≤ c. We know that in practice d i is increasing because the number of vertices usually decreases more quickly than the number of hyperedges. However, under our assumptions it remains very small compared to n. This is because d i ≤ n i for all i ≤ c and n i is decreasing so that n i n for i close to c.
Performance Model of the Parallel Multilevel Algorithm
We consider the computation and the communication requirements of each phase in turn, given O(log n) coarsening steps. n i . However, here d i ≤ n i n (as noted earlier), so that the computational requirement of the latter coarsening steps is dominated by the requirement of the first few coarsening steps.
During the serial initial partitioning phase, the hypergraph has size O(k) and can be heuristically partitioned to yield a "good" sub-optimal partition in O(k 2 ) computation steps [27] .
A single uncoarsening step consists of projecting a partition Π i+1 of H i+1 onto H i and refining the resulting partition of H i to obtain Π i . Projecting a partition involves at most O(n i /p) computation steps on each processor. We consider a single pass of the parallel greedy refinement algorithm, and assume that the refinement algorithm terminates in a small number of passes. Vertex gains are computed concurrently and then rebalancing moves are computed on the root processor, if required. In order to compute the gains for a vertex move, the algorithm needs to visit all the hyperedges incident on that vertex and determine their connectedness to the source and destination parts of the move. This requires O(d i n i l i /p) computation steps per pass. The rebalancing computation has complexity O(pk 2 ). Arguing as for the coarsening phase, the overall computation requirement during each uncoarsening step (including assembling incident hyperedges) is O(n i /p) + O(pk 2 ).
The overall asymptotic computational complexity of our parallel partitioning algorithm is thus given by
We now shift our attention to an average-case communication cost analysis, assuming that the underlying parallel architecture is a p-processor hypercube with bidirectional links and store-and-forward routing.
Again, we first consider the procedure of assembling the hyperedges incident to locally held vertices on each processor. This is done using an all-to-all personalized communication. ). An all-to-all personalised communication with this message size can be performed in O(n i /p) time on a hypercube [30] . In the multilevel steps involving the coarser hypergraphs (i.e. where b ≤ i ≤ c, for some b > 1), we do not necessarily have d i n i ; however, here d i ≤ n i n and the communication requirements of these steps are dominated by the communication requirements of the multilevel steps that involve the larger hypergraphs.
We now consider the cost of communicating the required matching vector entries in computing the local subset of E i+1 from the local subset of E i on each processor and the subsequent load balancing communication of hyperedges of
hyperedges from E i on each processor, to construct the corresponding subset of E i+1 , each processor requires O(n i l i /p) values from the map g i (this requirement was discussed in Section 3.2). Assuming that each of the required entries in g i are on average equally likely to be stored on any of the p processors, the message size between any two processors in the all-to-all communication is on average O(n i l i /p 2 ). In the load balancing communication, the hyperedges in E i+1 are scattered across the p processors with equal probability, thus also giving an average message size in the all-to-all
). Hence, given that l i n i , these all-to-all personalized communications can be done in O(n i /p) time.
By a similar argument, assuming that a vertex is on average equally likely to match with any other vertex in the hypergraph, during each coarsening step the cost of communicating matching requests and their outcomes is done in O(n i /p) time. During each coarsening step, we also require the computation of prefix sums to determine the numbering of the vertices in the coarser hypergraph, which has complexity O(log p). During refinement, we require an additional broadcast of rebalancing moves and a reduction operation to compute the cutsize, which have complexities O(k 2 log p) and O(log p) respectively (since each processor may be required to take moves back in O(k 2 ) directions).
Arguing as for the computational complexity, we deduce that the overall average-case asymptotic communication cost of our parallel partitioning algorithm is
Eliminating dominated terms from equations 13 and 14, the asymptotic total average-case parallel run time is
As the complexity of the serial algorithm is O(n), it follows that the algorithm is cost-optimal (scalable) for large n as the O(n/p) term dominates the parallel runtime.
To derive the isoefficiency function (for a discussion of isoefficiency see e.g. [30] ), we note that the complexity of the serial multilevel algorithm (and thus problem size W ) is O(n) when l n. From equation 15, the total overhead T o of the parallel algorithm becomes:
The isoefficiency function is then given by
and thus, omitting the lower order terms,
We note that this isoefficiency function is of the same order as that given in [39] for the parallel graph partitioning algorithm implemented in the ParMeTiS tool [41] .
Model of Algorithm with Parallel Multi-phase Refinement
Since we assume that parallel multi-phase refinement converges in a small number of iterations, the performance model of this algorithm differs from that of the multilevel algorithm analysed in Section 3.6.1 only in the parallel restricted coarsening phase described in Section 3.5.
A single parallel restricted coarsening step consists of two stages. In the first stage, the vertices belonging to the same part in Π i are assigned to the same processor; in the second stage the coarsening algorithm computes the map g i concurrently on all processors in parallel, without communication. The vertices are then allocated to processors using an all-to-all personalised communication. We assume that the vertices have an equal probability of being assigned to each of the k parts of the partition and thus, assuming a random distribution of vertices to processors prior to partitioning, the average message size in the all-to-all communication will be O(n i /p 2 ) so that the all-to-all communication can be done in O(n i /p) time on a hypercube. Since the k-way partition is subject to the balancing constraint of equation 4, each processor will on average hold O(n i /p) vertices during a parallel restricted coarsening step (assuming a low variance in vertex weights). The computational requirement of a single step of the parallel restricted coarsening algorithm is O(n i d i l i /p), followed by the construction of the coarse hyperedge set E i+1 , as in the standard multilevel algorithm. Thus, since d i n i and l i n i for the dominating multilevel steps (i.e. those involving the largest hypergraphs), the asymptotic runtime and the isoefficiency function of the parallel multilevel algorithm with parallel multi-phase refinement are the same as those of the parallel multilevel algorithm analysed in Section 3.6.1.
Experimental Evaluation
Implementation and Test Environment
The three phases of our parallel multilevel k-way partitioning algorithm have been implemented in C++ using the Message Passing Interface (MPI) standard [47] , thus forming the Parkway 2.1 parallel hypergraph partitioning tool. For the initial partitioning phase, our implementation provides an interface to the HMETIS PartKway() routine from the hMeTiS [37] library; this is used for serial partitioning when the coarsest hypergraph from the parallel coarsening phase has less than 200 × k vertices.
The architecture used in all experiments reported below consists of a Beowulf Linux Cluster with 64 dual-processor nodes, although we were restricted to a 32-processor (16-node) partition due to configuration limitations and high machine utilisation. Each node in the cluster has two Intel Pentium 4 processors running at 2GHz with 2GB of RAM. The nodes are connected by a Myrinet network with a peak throughput of 250 MB/s. The serial base cases reported below (generated by the PaToH [16] tool) were run as single processor jobs on this cluster.
Test Hypergraphs
The hypergraphs used in our experimental evaluation have been derived either from large sparse matrices that arise in scientific numerical computations, or models of a physical system (such as a VLSI circuit). For the sparse matrices, the corresponding hypergraphs represent the 1D row-wise decomposition of the sparse matrix for parallel matrix-vector multiplication, as described in [13] . Thus, a given sparse matrix defines the incidence matrix of its corresponding hypergraph, except that non-zeros are added to the entries of the main diagonal of the incidence matrix wherever the sparse matrix has a zero on the main diagonal. The weights of the vertices in the hypergraph are set to be the number of non-zeros in the corresponding row of the sparse matrix, while weights of the hyperedges are set to unity. Finally, single-vertex hyperedges are removed, since they will not contribute to the value of the k-1 metric of the computed partition. Table 1 shows the main characteristics of the test hypergraphs. The first three matrices were obtained from the University of Florida Sparse Matrix Collection [18] . The entries in the Stanford matrix represent the link structure between URLs within the Stanford University domain. This matrix has been used in PageRank calculations in [29] . ATTpre2 represents a set of linear equa- 
Empirical Scalability Evaluation
In this set of experiments, we set out to empirically validate the asymptotic scalability analysis from Section 3.6. Two experiments are performed. In the first, we compute speedups relative to the PaToH serial partitioning tool and also investigate partition quality as the number of processors is increased. In the second, we investigate the empirical validity of our scalability analysis by observing whether processor efficiency is maintained with an increasing number of processors, given that the problem size is increased according to our derived isoefficiency function.
For these experiments, our parallel algorithm was configured as follows. The reduction ratio (cf. equation 5) was set to 1.75. We set the partitioning objective to SOED (sum of external degrees) in HMETIS PartKway() (although we report the resulting k-1 objective values) and ran it with the First-Choice coarsening option and the V-Cycle best intermediate refinement option. The SOED metric is computed using equation 3, but using λ i instead of λ i − 1 to scale the hyperedge weight so that every hyperedge contributes to the cutsize. We used it because HMETIS PartKway() cannot directly minimize the k − 1 objective and the two metrics are very closely related. During the parallel uncoarsening phase, only the best partition from the serial partitioning phase was projected. Partitions with a balance constraint of 5% were sought ( = 0.05 in equation 4). The tool PaToH [16] provided the serial base-case comparison and was run with default parameters and the PATOH CONPART (k-1 metric) partitioning objective. To ensure that the final partition satisfies the balance constraint of equation 4 with = 0.05 when computed by recursive bisection using PaToH 1 , the balance constraint on each individual bisection was set to ((1 + )/k) 1/ log 2 k − 0.5. Table 3 shows the overhead (measured as the proportion of the total number of pins replicated across the processors and shown in bold, as well as our simulated overhead figures for contiguous and random allocation of hyperedges to processors for the initial hypergraph only) associated with assembling frontier hyperedges and the average imbalances in the hypergraph distribution across the processors prior to the load balancing communications. We note that the number of frontier pins is significantly reduced when the contiguous rather than random hyperedge allocation is used for the initial hypergraph; this may be because the incidence matrices have a substantial amount of non-zeros around the main diagonal. The average frontier hyperedge overhead is low for the ATTpre2, voting175 and bcsstk32 hypergraphs, but is more significant for the cage13, ibm18 and Stanford hypergraphs, the difference being accounted for by factors such as maximum vertex/hyperedge degrees and the degree of locality in the hypergraph structure. The observed hyperedge and vertex ratios justify the load balancing communication at each multilevel step; for example, an average ratio of 1.05 at each of 8 multilevel steps potentially leads to a final ratio of 1.05 8 = 1.48 without explicit load balancing. Tables 6 and 7 in the Appendix summarise the observed performance of Parkway2.1 when executed on the benchmark hypergraphs with a varying number of processors (p ≥ 2). The PaToH tool provides the serial base case and the runtimes and cutsizes shown are the averages of ten runs. Figure 1 shows the speedups over the PaToH serial base case for each hypergraph. For voting175 and ATTpre2, speedup increases with the number of processors; in the case of ATTpre2, it then drops off when more than 24 processors are used. A possible explanation may be the presence of a small number of hyperedges that are significantly larger than the average hyperedge length (cf. Table 2 ). These may cause imbalanced all-to-all communications, with a small number of messages that are signficantly larger than the rest. The speedups for Stanford, bcsstk32 and ibm18 are shallow, while there is no absolute speedup for cage13. These observations are consistent with the frontier hyperedge overheads shown in Table 3 . It is not surprising that we observe best speedups for the voting175 hypergraph; in terms of sparsity and low maximum vertex and hyperedge degree, it most closely fits the assumptions of our performance model in Section 3.6. Figure 2 plots the partition quality delivered by Parkway2.1 in terms of the ratio of its cutsize (k − 1 metric) to the cutsize of the PaToH serial base case. We observe that the partitioning performance of our parallel algorithm closely matches and often exceeds that of PaToH. We conjecture two reasons for this. Firstly, within our parallel multilevel framework, many partitioning runs are performed on the coarsest hypergraph (concurrently on each processor). Thus, it is likely that our uncoarsening phase begins with a superior initial partition. Secondly, our parallel algorithm uses direct k-way partitioning whereas PaToH uses recursive bisection. For larger values of k, the direct k-way partitioning approach may be better than recursive bisection because it maintains a global view when making vertex moves, rather than a divide-and-conquer approach. In addition, the recursive bisection approach may be further constrained by the need to enforce tight balance constraints on each individual bisection run. These results also suggest that the partition quality provided by our parallel algorithm is maintained as the number of processors increases.
The existence of the isoefficiency function should enable our parallel algorithm to maintain a constant level of efficiency as the number of processors is increased, by appropriately increasing the problem size [30] . That it to say, with our isoefficiency function O(p 2 log p) (for fixed partition size), if the number Table 3 Vertex and hyperedge distribution statistics for varying numbers of processors. The frontier pins column shows the total number of pins duplicated in frontier hyperedges across all processors for all multilevel steps in bold, expressed as a multiple of total number of pins in all the hypergraphs in the multilevel sequence. Also shown in parentheses are simulated overhead figures for a contiguous (c) and random (r) allocation of hyperedges to processors for the initial hypergraph only; note that vertices are always allocated contiguously to processors. The hyperedge ratio and vertex ratio columns are the average ratios of the maximum and minimum numbers of hyperedges/vertices of a hypergraph on individual processors before load balancing. These latter two averages are computed over all multilevel steps.
ATTpre2 frontier pins hyperedge ratio vertex ratio 2 0.07 (0.05c,0.53r) of processors is increased from p to p , then the problem size must increase by a factor ψ p = (p 2 log p )/(p 2 log p) in order to maintain the efficiency achieved using p processors. To validate this empirically, we make use of the voting family of hypergraphs that cover a wide range of problem sizes and share a similar structure that satisfies our assumptions of low maximum hyperedge and vertex degrees. We take as our base case the voting100 hypergraph partitioned using 2 processors, and represent the problem size as the number of pins in the hypergraph. For higher values of p, we compute the required ideal problem size and choose the voting hypergraph with the number of pins that most closely matches this number. We stop at p = 10, since above this value the per-processor memory requirement becomes a limiting factor.
In order to compute parallel efficiencies on hypergraphs that could not be partitioned on a single processor, we approximate serial runtimes by fitting a linear regression model to a log-log plot of the observed serial PaToH runtimes, as shown in equation 18 . Here η is the residual term representing the variation between log n and log T s not explained by the linear model.
The R 2 value of this regression is well over 0.99, meaning that this model explains more than 99% of the variation between (the log of) PaToH runtimes and the (log of the) number of vertices in the hypergraph n. The full table of observed and extrapolated serial runtimes for k = 8, k = 16 and k = 32 are presented in Table 8 in the Appendix. Table 4 presents the efficiencies observed for k = 8, k = 16 and k = 32 as the number of processors increases, given that the problem size is also increased according to our isoefficiency function. Here, the "actual ψ p " row refers to the actual increase in the problem size, given that we chose the voting hypergraph which best approximates the required increase. We note that efficiency remains relatively stable, as predicted by our scalability analysis.
Empirical Partition Quality Improvement Through Parallel Multi-phase Refinement
We conducted a further set of experiments to compare empirically the quality of partitions produced by the parallel multilevel algorithm implemented as in Section 4.3 with those produced by the parallel multilevel algorithm utilising parallel multi-phase refinement, as described in Section 3.5.
During the parallel uncoarsening phase, our implementation employed parallel multi-phase refinement at each multilevel step. Before each call to the parallel k-way refinement algorithm, a random permutation of vertices to processors was performed, as it was found to improve the performance of the parallel multi-phase refinement algorithm. Each processor generates a pseudorandom vector (whose length equals the number of locally stored vertices) of integers, each from 1 to p (inclusive). Each processor then sends the vertices to the processor given by the corresponding value in the random vector. The remainder of the coarsening and refinement parameters were implemented as in Section 4.3, and again partitions with balance criterion of 5% were sought. Since vertices from the same part are allocated to the same processor when using the parallel multi-phase refinement algorithm, our implementation was restricted to configurations of p and k where k is an integer multiple of p to ensure a good load balance. Table 5 shows the runtimes and partition cutsizes obtained using parallel multi-phase refinement for the voting175, ATTpre2 and cage13 hypergraphs. Across the hypergraphs, the average percentage improvements in cutsize are 3.1%, 4.9% and 8.0% for k = 8, k = 16 and k = 32 respectively. We expect that the increase in the number of feasible partitions as partition size increases explains the greater observed average improvements. However, we observe that in a few cases, cutsize actually degrades (due to the randomized nature of the algorithm); in addition the runtimes of the parallel algorithm with parallel multi-phase refinement are an order of magnitude slower than when multiphase refinement is not used. Thus, except for situations where the quality of partition is the main objective and runtime is not a consideration, it is unlikely that parallel multi-phase refinement would be preferred to the "vanilla" k-way parallel refinement algorithm.
Conclusion and Future Work
In this paper, we have presented the first parallel algorithms for the hypergraph partitioning problem and have shown the scalability for certain classes of hypergraph under a theoretical performance model. Specifically, we have enabled large hypergraphs with small maximum vertex and hyperedge degrees to be partitioned in a scalable manner, something that had hitherto not been achieved. In the context of the multilevel paradigm, we presented a parallel coarsening algorithm based on the serial FC/HCC algorithms [38, 13] and two parallel refinement algorithms, based on serial greedy k-way [38] and serial multi-phase [36] refinement respectively.
The algorithms were empirically evaluated on hypergraphs representing sparse matrix problems from the domains of DNA electrophoresis, analog circuits, structural engineering, VLSI circuit design, PageRank computations and performance analysis. We observed reasonable speedups on hypergraphs with small maximum hyperedge degrees and/or a high degree of structural locality, and partitions that were competitive with those produced by the state-of-theart serial multilevel partitioner PaToH [16] . The isoefficiency study conducted on the voting family of hypergraphs validated our theoretical performance model.
Using the parallel multi-phase refinement algorithm instead of the parallel greedy k-way algorithm, we were able to achieve even better partition quality over the basic parallel k-way refinement algorithm (with average improvements of 3.1%, 4.9% and 8.0% for k = 8, k = 16 and k = 32 respectively), albeit at a significantly larger runtime cost.
There are a number of directions that we can see for future work. In particular, the algorithms in this paper are designed to be applied to general hypergraphs representing sparse irregular problems and as such have not been optimised for particular application domains. We anticipate that an improvement in runtime and also partition quality would be possible if the algorithms were tailored to take advantage of any inherent domain-specific structure in the hypergraphs to achieve a better data distribution among the processors and guide the coarsening and refinement algorithms. Secondly, it is clear that frontier hyperedge overhead is the main limiting factor in terms of the scalability of our algorithms. A different approach to data distribution could be helpful in this respect. While here we have considered a data distribution scheme analogous to one-dimensional row-wise decomposition for parallel sparse matrix-vector multiplication, concurrent work presented in [19] uses a two-dimensional distribution. Experimental results suggest that this distribution results in lower overheads when partitioning hypergraphs with larger maximum vertex and hyperedge degrees. Table 8 PaToH runtimes on the voting hypergraphs. For voting100 to voting200 the runtimes are averages over ten runs. Remaining runtimes are approximated by a log-log least squares regression over the observed runtimes. The regression α is the zero intercept and the regression β is the slope of the line. The quantities in brackets denote the standard errors of the parameters. 
