We study the problem of one-dimensional partitioning of nonuniform workload arrays with optimal load balancing for heterogeneous systems. We look at two cases: chain-on-chain partitioning, where the order of the processors is specified, and chain partitioning, where processor permutation is allowed. We present polynomial time algorithms to solve the chain-on-chain partitioning problem optimally, while we prove that the chain partitioning problem is NP-complete. Our empirical studies show that our proposed exact algorithms produce substantially better results than heuristics while the solution times remain comparable.
Introduction
In many applications of parallel computing, load balancing is achieved by mapping a possibly multi-dimensional computational domain down to a onedimensional (1D) array, and then partitioning this array into parts with equal weights. Space filling curves are commonly used to map the higher dimensional domain to a 1D workload array to preserve locality and minimize communication overhead after partitioning [6, 7, 10, 15] . Similarly, processors can be mapped to a 1D array so that communication is relatively faster between close processors in this processor chain [11] . This eases mapping for computational domains and improves efficiency of applications. The load balancing problem for these applications can be modeled as the chain-on-chain partitioning (CCP) problem, where we map a chain of tasks onto a chain of processors.
Formally, the objective of the CCP problem is to find a sequence of P − 1 separators to divide a chain of N tasks with associated computational weights into P consecutive parts to minimize maximum load among processors.
In our earlier work [17] , we studied the CCP problem for homogenous systems, where all processors have identical computational powers. We have surveyed the rich literature on this problem, proposed novel methods as well as im-provements on existing methods, and studied how these algorithms can be implemented efficiently to be effective in practice. In this work, we investigate how these techniques can be generalized for heterogeneous systems, where processors have varying computational powers. Two distinct problems arise in partitioning chains for heterogeneous systems. The first problem is the CCP problem, where a chain of tasks is to be mapped onto a chain of processors, i.e., the pth task subchain in a partition is assigned to the pth processor. The second problem is the chain partitioning (CP) problem, where a chain of tasks is to be mapped onto a set, as opposed to a chain, of processors, i.e., processors can be permuted for subchain assignments. For brevity, the CCP problem for homogenous systems and heterogeneous systems will be referred to as the homogenous CCP problem and heterogeneous CCP problem, respectively. The CP problem refers to the chain partitioning problem for heterogeneous systems, since it has no counterpart for homogenous systems.
In this article, we show that the heterogeneous CCP problem can be solved in polynomial time by enhancing the exact algorithms proposed for the solution of the homogenous CCP problem [17] . We present how these exact algorithms for homogenous systems can be enhanced for heterogeneous systems and implemented efficiently for runtime performance. We also present how the heuristics widely used for the solution of homogenous CCP problem can be adapted for heterogeneous systems. We present the implementation details and pseudocodes for the exact algorithms and heuristics for clarity and reproducibility. Our experiments with workload arrays coming from image-spaceparallel volume rendering and row-parallel sparse matrix vector multiplication applications show that our proposed exact algorithms produce substantially better results than the heuristics while the solution times remain compara-ble. On average, optimal solutions provide ??times better load imbalance than heuristics for 128-way partitionings of volume rendering and sparse matrix datasets, respectively. On average, the time it takes to compute an optimal solution is less than ::::
2.20 ?? times the time it takes to compute an approximation using heuristics for 128 processors, and thus the preprocessing times can be easily compensated by the improved efficiency of the subsequent computation even for a few iterations.
The CP problem on the other hand, is NP-complete as we prove in this paper.
Our proof uses a pseudo-polynomial reduction from the 3-Partition problem, which is known to be NP-complete in the strong sense [8] . Our empirical studies showed that processor ordering has a very limited effect on the solution quality, and an optimal CCP solution on a random processing ordering serves as an effective CP heuristic.
The remainder of this paper is organized as follows. Table 1 summarizes important symbols used throughout the paper. Section 2 introduces the heterogeneous CCP problem. In Section 3, we summarize the solution methods for homogenous CCP. In Section 4, we discuss how solution methods for homogenous systems can be enhanced to solve the heterogeneous CCP problem.
In Section 5, we discuss the CP problem, prove that it is NP-Complete. We present the results of our empirical studies with the proposed methods in Section 6, and finally, we conclude with Section 7.
Chain-on-chain (CCP) Problem for Heterogeneous Systems
In the heterogeneous CCP problem, a computational problem, which is decomposed into a chain T = t 1 , t 2 , . . . , t N of N tasks with associated positive computational weights W = w 1 , w 2 , . . . , w N is to be mapped onto a pro- Table 1 The summary of important abbreviations and symbols Notation Explanation N number of tasks T task chain, i.e., T = t 1 , t 2 , . . . , t N t i ith task in the task chain T i,j task subchain of tasks from t i upto t j , i.e., T i,j = t i , t i+1 , . . . , t j w i computational load of task t i wmax maximum computational load among all tasks wavg average computational load of all tasks w min minimum computational load of all tasks W i,j total computational load of task subchain T i,j Wtot total computational load, i.e., Wtot = W 1,N P number of processors P processor chain, i.e., P = P 1 , P 2 , . . . , P P in the CCP problem processor set, i.e., P = {P 1 , P 2 , . . . , P P } in the CP problem Pp pth processor in the processor chain Pq,r processor subchain from Pq upto Pr, i.e., Pq,r = Pq, P q+1 , . . . , Pr ep execution speed of processor Pp Eq,r total execution speed of processor subchain Pq,r
Etot total execution speed of all processors, i.e., Etot = E 1,P B * ideal bottleneck value, achieved when all processors have load in proportion to their speed UB upper bound on the value of an optimal solution LB lower bound on the value of an optimal solution sp index of the last task assigned to the pth processor. lg x base-2 logarithm of x, i.e., lg x = log 2 x.
cessor chain P = P 1 , P 2 , . . . , P P of P processors with associated execution speeds E = e 1 , e 2 , . . . , e P . The execution time of task t i on processor P p is w i /e p . For clarity, we note that there are no precedence constraints among the tasks in the chain.
A task subchain T i,j = t i , t i+1 , . . . , t j is defined as a subset of contiguous tasks.
Note that T i,j defines an empty task subchain when i > j. The computational weight of T i,j is W i,j = i≤h≤j w h . A partition Π should map contiguous task subchains to contiguous processors. Hence, a P -way partition of a task chain with N tasks onto a processor chain with P processors is described by a sequence Π = s 0 , s 1 , . . . , s P of P +1 separator indices, where
Here, s p denotes the index of the last task of the pth part so that processor P p receives the task subchain T s p−1 +1,sp with load
The cost C(Π) of a partition Π is determined by the maximum processor load among all processors, i.e.,
This C(Π) value of a partition is called its bottleneck value, and the processor defining it is called the bottleneck processor. The CCP problem is to find a partition Π opt that minimizes the bottleneck value C(Π opt ).
Similar to the task subchain, a processor subchain P q,r = P q , P q+1 , . . . , P r is defined as a subset of contiguous processors. Note that P q,r defines an empty processor subchain when q > r. The computational speed of P q,r is E q,r = q≤p≤r e p .
The ideal bottleneck value B * is defined as
where E tot is the sum of all processor speeds and W tot is the total task weight;
i.e., E tot = E 1,P and W tot = W 1,N . Note that B * can only be achieved when all processors are equally loaded, so it constitutes a lower bound on the achievable bottleneck values, i.e., B * ≤ C(Π opt ).
CCP Algorithms for Homogenous Systems
The homogenous CCP problem can be considered as a special case of the heterogeneous CCP problem, where the processors are assumed to have equal speed, i.e., e p = 1 for all p. Here, we review the CCP algorithms for homogenous systems. A comprehensive review and presentation of homogenous CCP algorithms are available in [17] .
Heuristics
Possibly the most commonly used CCP heuristic is recursive bisection (RB), a greedy algorithm. RB achieves P -way partitioning through lg P levels of bisection steps. At each level, the workload array is divided evenly into two.
RB finds the optimal bisection at each level, but the sequence of optimal bisections at each level may lead to a multi-way partition which is far away from an optimal. Pınar and Aykanat [17] proved that RB produces partitions with bottleneck values no greater than B * +w max (P − 1)/P .
Miguet and Pierson [13] proposed another heuristic that determines s p by bipartitioning the task chain in proportion to the length of the respective processor subchains. That is, s p is selected in such a way that W 1,sp /W 1,N is as close to the ratio p/P as possible. Miguet and Pierson [13] prove that the bottleneck value found by this heuristic has an upper bound of B * +w max .
These heuristics can be implemented in O(N +P lg N) time. The O(N) time is due to prefix-sum operation on the tasks array, after which each separator index can be found by a binary search on the prefix-summed array.
Dynamic Programming
The overlapping subproblems and the optimal substructure properties of the CCP problem enable dynamic programming solutions. The overlapping subproblems are partitioning the first i tasks onto the first p processors, for all possible i and p values. For the optimal substructure property, observe that if the last processor is not the bottleneck processor in an optimal partition, then the partitioning of the remaining tasks onto the first P − 1 processors must be optimal. Hence, the recursive definition for the bottleneck value of an optimal partition is
Here, B p i denotes the optimal solution value for partitioning the first i tasks onto the first p processors. In Eq. (3), searching for index j corresponds to searching for separator s p−1 so that the remaining subchain T j+1,i is assigned to the last processor in an optimal partition. This definition defines a dynamic programming table of size P N, and computing each entry takes O(N) time, resulting in an O(N 2 P )-time algorithm. Choi and Narahari [3] , and Manne and Olstad [12] reduced the complexity of this scheme to O(NP ) and O((N − P )P ), respectively. Pınar and Aykanat [17] presented enhancements to limit the search space of each separator by exploiting upper and lower bounds on the optimal solution value for better practical performance.
Parametric Search
Parametric search algorithms rely on two components: a probing operation to determine if a solution exists whose bottleneck value is no greater than a specified value, and a method to search the space of candidate values. The probe algorithm can be computed only in O(P lg N) time by using binary search on the prefix-summed workload array. Below, we summarize algorithms to search the space of bottleneck values.
Nicol's Algorithm
Nicol's algorithm [14] exploits the fact that any candidate B value is equal to the weight of a task subchain. A naive solution is to generate all subchain weights, sort them, and then use binary search to find the minimum value for which a probe succeeds. Nicol's algorithm efficiently searches for this subchain by considering each processor in order as a candidate bottleneck processor. For each processor P p , the algorithm does a binary search for the smallest index that will make P p the bottleneck processor. 
Bidding Algorithm
The bidding algorithm [17, 16] starts with a lower bound and proceeds by gradually increasing this bound until a feasible solution value is reached. The increments are chosen to be minimal so that the first feasible bottleneck value is optimal. Consider the partition generated by a failed probe call that loads the first P −1 processors maximally not to exceed the specified probe value. To find the next bottleneck value, processors bid with the bottleneck value that would add one more task to their domain, and the minimum bid among the processors is chosen to be the next bottleneck value. The bidding algorithm moves each one of the P separators for O(N) positions in the worst case, where choosing the new bottleneck value takes O(lg P ) time using a priority queue.
This makes the complexity of the algorithm O(NP lg P ).
Bisection Algorithms
The bisection algorithm starts with a lower and an upper bound on the solution value and uses binary search in this interval. If the solution value is known to be an integer, then the bisection algorithm finds an optimal solution. Otherwise, it is an ǫ-approximation algorithm, where ǫ is the user defined accuracy for the solution. The bisection algorithm requires O(lg(w max /ǫ)) probe calls, with O(N +P lg N lg(w max /ǫ)) overall complexity.
Pınar and Aykanat [17] enhanced the bisection algorithm by updating the lower and upper bounds to realizable bottleneck values (subchain weights).
After a successful probe, the upper bound can be set to be the bottleneck value of the partition generated by the probe function, and after a failed probe, the lower bound can be set to be the smallest value that might succeed, as in the bidding algorithm. These enhancements transform the bisection algorithm to an exact algorithm, as opposed to an ǫ-approximation algorithm.
Proposed CCP Algorithms for Heterogeneous Systems
The algorithms we propose in this section extend the techniques for homogenous CCP to heterogeneous CCP. All algorithms discussed in this section require an initial prefix-sum operation on the task-weight array W for the effi- focus only on finding the optimal solution value, since an optimal solution can be easily constructed, once the optimal solution value is known.
Unless otherwise stated, BINSEARCH represents a binary search that finds the index to the element that is closest to the target value. There are variants of BINSEARCH to find the index of the greatest element not greater than the target value, and we will state whenever such variants are needed. BINSEARCH takes four parameters: the array to search, the start and end indices of the sub-array, and the target value. The range parameters are optional, and their absence means that the search will be performed on the whole array.
Heuristics
We propose a heuristic, RB, based on the recursive bisection idea. During each bisection, RB performs a two step process. First, it divides the current processor chain P p,r into two subchains P p,q and P q+1,r . Then, it divides the current task chain T h,j into two subchains T h,i and T i+1,j in proportion to the computational powers of the respective processor subchains. That is, the task separator index i is chosen such that the ratio W h,i /W i+1,j is close to the ratio E p,q /E q+1,r , as much as possible. RB achieves optimal bisections at each level,
however, the quality of the overall partition may be far away from that of the optimal solution.
We have investigated two metrics for bisecting the processor chain: chain length and chain processing power. The chain length metric divides the current processor chain P p,r into two equal-length processor subchains, whereas the chain processing power metric divides P p,r into two equal-power subchains.
Since the first metric performed slightly better than the second one in our experiments, we will only discuss the chain length metric here. The pseudocode of the RB algorithm is given in Fig. 1 , where the initial invocation takes its parameters as (W, E, 1, P ) with s 0 = 0 and s P = N. Note that s p−1 and s r are already determined at higher levels of recursion. Wtot is the total weight of current task subchain, and Wfirst is the weight for the first processor subchain in proportion to its processing speed. We need to add W 1,s p−1 to Wfirst to seek it in the prefix-summed W array.
We also propose a generalization of Miguet and Pierson's heuristic, MP [13] .
MP computes the separator index of each processor by considering that processor as a division point for the whole processor chain. In our version, the load assigned to the processor chain P 1,p is set to be proportional to the computational power E 1,p of this subchain, as shown in Fig. 1 .
time is due to the initial prefix-sum operation on the task-weight array.
Below, we investigate the theoretical bounds on the quality of these two heuristics. We assume P is a power of 2 for simplicity.
Lemma 4.1 B RB is upper bounded by B * +w max /e min − w max /(P e min ).
Proof: We use induction, and the basis is easy to show for P = 2. For the inductive step, assume the hypothesis holds for any number of processors less than P . Consider the first bisection, where the processors are split into two subchains, each containing P/2 processors. Let the total processing power in the left subchain be E left . RB will distribute the workload array between the left and right processor subchains as evenly as possible. There will be a task t i such that the left processor subchain will weigh more than the right subchain if t i is assigned to the left subchain, and vice versa. Without loss of generality, assume that t i is assigned to the left subchain. In the worst case, t i is the maximum weighted task, and the total task weight assigned to the left subchain, W left , can be upper bounded by
Using the inductive hypothesis, the bottleneck value among the processors of the left processor subchain can be upper bounded as follows.
The same bound applies to the right processor subchain directly by the inductive hypothesis, since right processor subchain is already underloaded.
Lemma 4.2 B M P is upper bounded by B * + w max /e min .
Proof: Let the sequence s 0 , s 1 , . . . , s P be the partition constructed by MP. For a processor P p , s p is chosen to be the separator that best divides P 1,p and P p+1,P . Based on our discussion of bipartitioning quality in the proof of Lemma 4.1, W 1,sp is bounded by
So, the load of processor p is upper bounded by
Dynamic Programming
The overlapping subproblems and the optimal substructure properties of the homogenous CCP can be extended to the heterogeneous CCP, and thus enabling dynamic programming solutions. The recursive definition for the bottleneck value of an optimal partition can be derived as
for the heterogeneous case. As in the homogenous case, B , W j+1,i /e p } and max{B
That is, the recursive definition becomes:
e p , where j p i = argmin
It is clear that the search ranges of separators overlap at only one position, and thus we can compute all B Manne and Olstad [12] reduced the complexity further to O((N − P )P ) by observing that there is no merit in leaving a processor empty, and thus the search for j p i can start at p instead of 1. However, this does not apply to the heterogeneous CCP, since it might be beneficial to leave a processor empty.
We propose another DP algorithm by extending the DP+ algorithm (DP algorithm with static separator-index bounding) of Pınar and Aykanat [17] We extend the probing operation to the heterogeneous case as shown in Lemma 4.3 For a given heterogeneous CCP instance (W, N, E, P ), a feasible bottleneck value UB and a lower bound on the bottleneck value LB ; let the 
The lower bound LB can be initialized to the optimal lower bound when all processors are equally loaded as
An upper bound UB can be computed in practice with a fast and effective heuristic, and Lemma 4.1 provides a theoretically robust bound as
Parametric Search
Parametric search algorithms can be constructed with a PROBE function (either LR-PROBE or RL-PROBE given in Fig. 3) , and a method to search the space of candidate values. Below, we describe several algorithms to search the space of bottleneck values for the heterogeneous case.
Nicol's Algorithm
We revise Nicol's algorithms for heterogeneous systems as follows. The candidate B values become task subchain weights divided by processor subchain speeds. The algorithm starts with searching for the smallest j so that probing with W 1,j /e 1 succeeds, and probing with W 1,j−1 /e 1 fails. This means W 1,j−1 /e 1 < B opt ≤ W 1,j /e 1 , and thus in an optimal solution the probe function will assign the first j tasks to the first processor if it is the bottleneck processor, and the first j − 1 tasks to the first processor if not. Then the optimal solution value is the minimum of W 1,j /e 1 and the optimal solution value for partitioning the remaining subchain T j,N to P − 1 processors, since any solution with a bottleneck value less than W 1,j /e 1 will assign only the first j − 1 tasks to the first processor. Finding the j value requires lg N probes, and we repeat this search operation for all processors in order. This version of Nicol's algorithm runs in O(N +(P lg N) 2 ) time. Fig. 4 (a) displays this algorithm.
Nicol's Algorithm with Dynamic Bottleneck-Value Bounding
By keeping the largest B that succeeded and the smallest B that failed, we can improve Nicol's algorithm by eliminating unnecessary probing. Let LB and UB represent the lower bound and upper bound for B opt , respectively.
If a processor cannot update LB or UB , that processor does not make any PROBE calls. This algorithm, presented in Fig. 4(b) , is referred to as NICOL+.
In the worst case, a processor makes O(lg N) PROBE calls. But, as we will prove below, the number of probes performed by NICOL+ cannot exceed P lg (1+w max /(P e min w min )). This analysis also improves known complexities of homogeneous version of the algorithm. Lemma 4.4 describes an upper bound on the number of probes performed by NICOL+ algorithm. Lemma 4.4 The number of probes required by NICOL+ is upper bounded by P lg (1+(UB − LB ) / (P w min )).
Proof: Consider the first step of the algorithm, where we search for the smallest separator index that makes the first processor the bottleneck processor. We can restrict this search in a range that covers only those indices for which the weight of the first chain will be in the [LB , UB ] interval. If there are n 1 tasks in this range, NICOL+ will require lg n 1 probes. This means that the [LB , UB ] interval is narrowed by at least (n 1 − 1)w min after the first step.
Let k p be the number of probes by the pth processor. Since k p probes narrows the
The corresponding total number of probes is P −1 p=1 k p , which reaches its maximum when
In that case,
and thus
So, the total number of probes performed by NICOL+ is upper bounded by:
Corollary 4.5 NICOL+ requires at most P lg(1 + w max /(P e min w min )) probes for heterogeneous, and P lg(1 + w max /(P w min )) probes for homogeneous systems.
NICOL+ runs in O(N+P 2 lg N lg(1+w max /(P e min w min ))) time, with the O(P lg N)
cost of a PROBE call. In most configurations, w max /(e min w min P ) is very small, and is O(1) if P e min = Ω(w max /w min ). In that case, the runtime complexity of
Bidding Algorithm
For heterogeneous systems, the bidding algorithm uses the lower bound given in Eq. 5 for optimal bottleneck value, and gradually increases this lower bound. The bid of each processor P p , for p = 1, 2, . . . , P − 1, is calculated as W s p−1 +1,sp+1 / e p , which is equal to the load of P p if it also executes the first task of P p+1 in addition to its current load. Then, the algorithm selects the processor with the minimum bid value so that this bid value becomes the next bottleneck value to be considered for feasibility. The processors following the bottleneck processor in the processor chain are processed in order, except the last processor. The separator indices of these processors are adjusted accordingly so that the processors are maximally loaded not to exceed that new bottleneck value. The load of the last processor determines the feasibility of the current bottleneck value. If current bottleneck value is not feasible, the process repeats. functions refer to the respective priority queue operations [4] .
In the worst case, the bidding algorithm moves P separators for O(N) positions. Choosing a new bottleneck value takes O(lg P ) time using a binary heap implementation of the priority queue. Totally the complexity of the algorithm is O(NP lg P ) in the worst case. Despite this high worst-case complexity, the bidding algorithm is quite fast in practice.
Bisection Algorithm
For heterogeneous systems, the bisection algorithm can use the LB and UB values given in Eqs. 5 and 6. A binary search on this [LB , UB ] interval requires O(lg(w max /(ǫE tot ))) probes, thus leading to an O(lg(w max /(ǫE tot ))P lg N)-time algorithm, where ǫ is the specified accuracy of the algorithm. Fig. 6(a) presents this ǫ-approximation bisection algorithm. We should note that, although the homogenous version of this algorithm becomes an exact algorithm for integer-valued workload arrays by setting ǫ = 1, this is not the case for heterogeneous systems.
We enhance this bisection algorithm to be an exact algorithm for heterogeneous systems by extending the scheme proposed by Pınar and Aykanat [17] for homogenous systems. After each probe, we move lower and upper bounds to realizable bottleneck values, as opposed to the probed value. In heterogeneous systems, realizable bottleneck values are subchain weights divided by appropriate processor speeds. After a successful probe, we decrease UB to the bottleneck value of the partition constructed by the probe, and after a failed probe we increase LB to the bid value as described for the bidding algorithm in Section 4.3.3. Each probe eliminates at least one candidate bottleneck value, and thus the bisection algorithm terminates in a finite number of steps with an optimal solution. Fig. 6(b) displays the exact bisection algorithm. 
Chain Partitioning (CP) Problem for Heterogeneous Systems
In this section, we study the problem of partitioning a chain of tasks onto a set of processors, as opposed to a chain of processors. The solution to this problem is not only separators on the task chain, but also processor-tosubchain assignments. Thus, we define a mapping M as a partition Π = s 0 = 0, s 1 , . . . , s P = N of the given task chain T = t 1 , t 2 , . . . t N with s p ≤ s p+1 for 0 ≤ p < P , and a permutation π 1 , π 2 , . . . , π P of the given set of P processors P = {P 1 , P 2 , . . . , P P }. According to this mapping, the pth task subchain t s p−1 +1 , . . . , t sp is executed on processor P πp . The cost C(M) of a mapping M is the maximum subchain computation time, determined by the subchain weight and the execution speed of the assigned processor, i.e.,
We will prove that the CP problem is NP-complete. The decision problem for the CP problem for heterogeneous systems is as follows.
Given a chain of tasks T = t 1 , t 2 , . . . , t N , a weight w i ∈ Z + for each t i ∈ T , a set of processors P = {P 1 , P 2 , . . . , P P } with P < N, an execution speed e p ∈ Z + for each P p ∈ P, and a bound B, decide if there exists a mapping M of T onto P such that C(M) ≤ B.
Theorem 5.1 The CP problem for heterogeneous systems is NP-complete.
Proof: We use reduction from the 3-Partition (3P) problem. A pseudopolynomial transformation suffices, because 3P problem is NP-complete in the strong sense (i.e., there is no pseudo-polynomial time algorithm for the problem unless P=NP). The 3P problem is stated in [8] as follows.
Given a finite set A of 3m elements, a bound B ∈ Z + , and a cost c i ∈ Z + for each a i ∈ A, where a i ∈A c i = mB and each c i satisfies B/4 < c i < B/2, decide if A can be partitioned into m disjoint sets S 1 , S 2 , . . . , S m such that
For a given instance of the 3P problem, the corresponding CP problem is constructed as follows.
• The number of tasks N is m(B + 1) − 1. The weight of every (B + 1)st task is B, (i.e., w i = B for i mod (B + 1) = 0), and the weights of all other tasks are 1.
• The number of processors P is 4m − 1. The first m − 1 processors have execution speeds of B, (i.e., e p = B for p = 1, 2, . . . , m − 1), and the remaining processors have execution speeds equal to the costs of items in the 3P problem (i.e., e p = c p−m+1 for p = m, . . . , 4m − 1).
We claim that there is a solution to the 3P problem if and only if there is a mapping M with cost C(M) = 1 for the CP problem. The following observations constitute the basis for our proof.
• The processors with execution speeds of B must be mapped to tasks with weight B to have a solution with cost C(M) = 1, because the execution speeds of all other processors are ≤ B/2. These processors (tasks) serve as divider processors (tasks).
• This forces each processor to be assigned a load with value equal to its execution speed to achieve a mapping with cost C(M) = 1.
As noted above, the divider processors should be assigned to the divider tasks.
Between two successive divider tasks there is a subchain of B unit-weight tasks with total weight B, which must be assigned to a subset of processors with total execution speed B. Since there are m such subchains, the same grouping of the processors is also valid for grouping c i values in the 3P problem. Thus the 3P problem can be reduced to the CP problem, proving the CP problem is NP-hard.
The cost of a given mapping can be computed in polynomial time, thus the problem is in NP. Thus we can conclude that the chain partitioning problem for heterogeneous systems is NP-Complete.
This complexity shows that we need to resort to heuristics for practical solutions to the CP problem. With the nearly perfect balance results and extremely fast runtimes as we will present in Section 6, CCP algorithms can serve as good heuristics for the CP problem. We tried this approach by finding optimal CCP solutions for randomly ordered processor chains of a CP instance. We observed that the sensitivity to processor ordering is quite low.
You can find a description of these studies in Section 6.3. We also tried improvement techniques, where we swapped processors in the chain to decrease the bottleneck value, but the improvements were modest and could hardly compensate for the increase in runtimes.
Experimental Results

Experimental Setup
The 1D task arrays used in both CCP and CP experiments were derived from two different applications: image-space-parallel direct volume rendering and row-parallel sparse matrix vector multiplication.
Direct volume rendering experiments are performed on three curvilinear datasets from NASA Ames Research Center [1] , namely Blunt Fin (blunt), Combustion
Chamber (comb), and Oxygen Post (post). These datasets are processed using the tetrahedralization techniques described in [9] and [18] to produce threedimensional (3D) unstructured volumetric datasets. The two-dimensional (2D) workload arrays are constructed by projecting 3D volumetric datasets onto 2D screens of resolution 256 × 256 using the workload criteria of image-spaceparallel direct volume rendering algorithm described in [2] . Here, the rendering operations associated with the individual pixels of the screen constitute the computational tasks of the application. The resulting 2D task array is then mapped to a 1D task array using Hilbert space-filling-curve traversal [15] . The workload distributions of the 2D task arrays are visualized in Fig. 7 , where darker areas represent more weighted tasks. The histograms at the bottom of the 2D pictures show the weight distributions of the resulting 1D task arrays.
In the sparse matrix experiments, we consider rowwise block partitioning of the matrices obtained from University of Florida Sparse Matrix Collection [5] . In row-parallel matrix vector multiplies, the rows correspond to the tasks to be partitioned, and the number of nonzeros in each row is the weight of the corresponding task. The nonzero distributions of the sparse matrices are shown in Fig. 8 . The histograms on the right sides of the visualizations represent the number of nonzeros in each row. Table 2 displays the properties of the 1D task chains used in our experiments.
In the volume rendering dataset, the number of tasks is considerably less than the screen resolution, because zero-weight tasks are omitted. In the sparse matrix dataset, the number of tasks is equal to the number of rows.
In both CCP and CP experiments, P = 32, 64, 128, 256, 512, 1024, and 2048-way partitioning of the 1D task arrays were performed. We experimented with different variances of processor speeds, where the processors speeds were chosen uniformly distributed in the 1-4, 1-8, and 1-16 ranges.
In the experiments, the P -way partitioning of a given task chain for a given 
CCP Experiments
The proposed CCP algorithms were implemented in the Java language. Tables 3-6 compare the solution qualities of heuristics with respect to those of the optimal partitions obtained by the exact algorithms. In these tables, OPT values refer to the optimal load imbalance values. Tables 3 and 4 respectively display the percent load imbalance values obtained in mapping the volume rendering and sparse matrix task chains onto processor chains with 1-8 execution speed range. As seen in these two tables, RB performs much better than MP. Out of 63 partitioning instances, RB found better solutions than MP in all but one instance.
As seen in Tables 3 and 4 , in general, the quality gap between exact algorithms and heuristics increases with increasing number of processors. For instance, in 2048-way partitioning of the torso1 matrix, best heuristic finds a solution with 252.44% load imbalance, which means a processor is loaded more than 3.5 times the average load, causing a slowdown as the number of processors
increase. An optimal solution however, will have a load imbalance value of 27.61%, providing scalability to thousands of processors. Tables 5 and 6 As seen in Tables 5 and 6 , in general, the performance gap between heuristics and exact algorithms decrease with decreasing processor speed range. However, there exists considerable quality difference between the heuristics and exact algorithms even for the smallest 1-4 speed range.
In constructing the processor chains for the experiments, in addition to the random processor ordering, we also investigated different orderings of the pro- cessors having the same speed. In this context, we experimented with the cases where processors having the same speed ordered consecutively, assuming that such processors belong to the same homogenous cluster and hence they are naturally adjacent to each other in the processor chain. We did not observe a considerable sensitivity of the relative load balancing performance between heuristics and exact algorithms to the ordering of processors having the same speed. Tables 3 and 4 , these results reveal the superiority of RB to MP. In Tables 7 and 8 , relative performances of exact CCP algorithms show that both NICOL+ and EBS are an order of magnitude faster than DP+ and BID for both volume rendering and sparse matrix datasets. As also seen in these two tables, EBS is slightly faster than NICOL+.
It is worth highlighting that for small to medium concurrency, the time it takes EBS and NICOL+ algorithms to find optimal solutions is less than three times the time of the fastest heuristic. More precisely, on overall average, EBS takes only 147% more time than the fastest heuristic for 256-way partitioning.
On the other hand, at higher number of processors, the solution qualities of heuristics degrade significantly: on overall average, optimal solutions provide 5.35, 5.47 and 6.00 times better load imbalance values than the best heuristic for 512, 1024 and 2048-way partitionings, respectively. According the these experimental results, we recommend the use of exact CCP algorithms instead of heuristics for heterogeneous systems. Table 9 displays the variation of running time performances of the CCP al- gorithms with varying processor speed ranges for the volume rendering and sparse matrix task chains. For a better performance comparison, execution times of the algorithms were normalized with respect to those of the RB heuristic and averages of these normalized values over P are presented in the table. We should mention here that the running time of the RB heuristic does Table 9 Partitioning time averages (over P ) of the exact CCP algorithms normalized with respect to those of the RB heuristic for different processor speed ranges not change with varying processor speed range, as expected. As seen in Table 9, notable performance variation occurs only for the BIDDING algorithm whose running time generally increases with increasing processor speed range.
CP Experiments
Tables 10 and 11 display the results of our experiments to show the sensitivity of the solution quality of CP problem instances to the processor orderings for the processor speed range of 1-8. In these experiments, we find the optimal CCP solutions for R randomly ordered processor chains of a CP instance, and display geometric averages of the best and average load imbalance values over number of processors. As seen in the tables, for a fixed P , the average imbalance values almost remain the same for different values of R. Although the best imbalance values decrease with increasing R, the decreases are quite small, especially for large P . Moreover, for a fixed R, the relative difference between the best and average imbalance values decreases with increasing P . Table 10 Geometric averages (over P ) of percent load imbalance ratios for R randomly ordered processor chains for the volume rendering dataset with the processor speed range of 1-8 R = 10 R = 100 R = 1000 R = 10000 Table 11 Geometric averages (over P ) of percent load imbalance ratios for R randomly ordered processor chains for the sparse matrix dataset with the processor speed range of 1-8 R = 10 R = 100 R = 1000 R = 10000 These experimental findings show that processor ordering has only a minor effect on solution quality. This is expected since the variance among processor speeds is low, unlike the variance among task weights. Therefore, using an exact CCP algorithm on a number of randomly permuted processor chains can serve as an effective heuristic for the CP problem. Table 12 displays the results of our experiments to show the sensitivity of the solution quality of CP problem instances to the processor speed range. In these experiments, for each CP instance, we find the optimal CCP solutions for R = 10000 randomly ordered processor chains, and display the best load imbalance value. As seen in Table 12 , we do not observe a considerable sensitivity of the solution quality of the CP problem instances to the procesor speed range. Notable sensitivity is observed only for the language, Stanford, Table 12 Best percent load imbalance ratios for R = 10000 randomly ordered processor chains with different processor speed ranges and Stanford Berkeley sparse matrix datasets, which have high task weight variation (i.e., large w max /w avg value). In these datasets, load imbalance values decrease with increasing processor speed range, which possibly because the adverse effect of tasks with large weight on load imbalance can be more easily resolved by mapping them to the processors with larger execution speed.
Conclusions
We studied the problem of one-dimensional partitioning of nonuniform workload arrays with optimal load balancing for heterogeneous systems. We investigated two cases: chain-on-chain partitioning, where a chain of tasks is partitioned onto a chain of processors; and chain partitioning, where the task chain is partitioned onto a set of processors (i.e., permutation of the processors is allowed). We showed that chain-on-chain partitioning algorithms for homogenous systems can be revised to solve this partitioning problem for heterogeneous systems, without altering computational complexities of these algorithms. We proved that the chain partitioning problem is NP-complete, and empirically showed that exact CCP algorithms can serve as an effective heuristic, for the CP problem. Our experiments proved the effectiveness of our techniques, as the exact algorithms work much better than heuristics, and balanced work decompositions can be achieved even for high numbers of processors.
Availability
The algorithms proposed in this work are implemented in Java language and made publicly available at http://www.cs.bilkent.edu.tr/˜tabak/hetccp/.
