A Many-core Machine Model for Designing Algorithms with Minimum
  Parallelism Overheads by Haque, Sardar Anisul et al.
A Many-core Machine Model for Designing Algorithms with Minimum
Parallelism Overheads
Sardar Anisul Haque1, Marc Moreno Maza2, and Ning Xie3
1shaque4@csd.uwo.ca , Department of Computer Science, University of Western Ontario
2moreno@csd.uwo.ca , Department of Computer Science, University of Western Ontario
3nxie6@csd.uwo.ca , Department of Computer Science, University of Western Ontario
Abstract
We present a model of multithreaded computation,
combining fork-join and single-instruction-multiple-
data parallelisms, with an emphasis on estimating par-
allelism overheads of programs written for modern
many-core architectures. We establish a Graham-Brent
theorem for this model so as to estimate execution time
of programs running on a given number of streaming
multiprocessors. We evaluate the benefits of our model
with four fundamental algorithms from scientific com-
puting. In each case, our model is used to minimize par-
allelism overheads by determining an appropriate value
range for a given program parameter; moreover experi-
mentation confirms the model’s prediction.
1 Introduction
Designing efficient algorithms targeting implementa-
tion on hardware acceleration technologies (multi-core
processors, graphics processing units (GPUs), field-
programmable gate arrays) creates major challenges for
computer scientists. A first difficulty is to define models
of computations retaining the computer hardware char-
acteristics that have a dominant impact on program
performance. Therefore, in addition to specify the ap-
propriate complexity measures for the algorithms to be
analyzed, those models must consider the relevant pa-
rameters characterizing the abstract machine executing
those algorithms. A second difficulty is, for a given
model of computations, to combine its complexity mea-
sures so as to determine the “best” algorithm among
different algorithmic solutions to a given problem.
In the fork-join concurrency model [1] two complex-
ity measures, the work T1 and the span T∞, and one
machine parameter, the number P of processors, can
be combined in results like the Graham-Brent theo-
rem [1, 7] or the Blumofe-Leiserson theorem (Theo-
rems 13 & 14 in [2]) in order to compare algorithm
running time estimates. We recall that the Graham-
Brent theorem states that the running time TP on P
processors satisfies TP ≤ T1/P + T∞. A refinement of
this theorem supports the implementation (on multi-
core architectures) of the parallel performance analyzer
Cilkview [10]. In this context, the running time TP
is bounded in expectation by T1/P + 2δT̂∞, where δ is
a constant (called the span coefficient) and T̂∞ is the
burdened span.
With the pervasive ubiquity of many-core processors,
in particular GPUs, it is desirable for models of com-
putations to combine explicitly both task-based paral-
lelism and data-based parallelism. In fact, popular con-
currency platforms (CilkPlus [11, 18], CUDA [16, 17]
and OpenCL [22]) offer both forms of parallelism, with
language constructs specific to each case. Meanwhile,
classical models of parallel computations, like the fork-
join concurrency model or the PRAM model [21, 5],
do not distinguish between task-based and data-based
parallelism, which is too simplistic for analyzing algo-
rithms targeting the above concurrency platforms. In
addition, the PRAM model fails to retain important
features of actual computers related to memory traffic,
such as cache complexity [4].
An attempt to integrate memory contention into the
PRAM model has been made with the QRQW (Queue
Read Queue Write) PRAM, defined in [6] by Gibbons,
Matias and Ramachandran. The authors also enhance
the Graham-Brent theorem. However, they conflate in
a single quantity time spent in arithmetic operations
and time spent in read/write accesses. We believe that
1
ar
X
iv
:1
40
2.
02
64
v1
  [
cs
.D
C]
  3
 Fe
b 2
01
4
this unification is not appropriate for recent many-core
processors, such as GPUs, for which the ratio between
one global memory read/write access and one floating
point operation can be in the 100’s.
In a recent paper, Ma, Agrawal and Chamberlain [13]
introduce the TMM (Threaded Many-core Memory)
model which retains many important characteristics of
GPU-type architectures, including several machine pa-
rameters such as throughput and coalesced granularity.
Moreover, TMM analysis can rank algorithms from slow
to fast, given those machine parameters, while their
running time estimate on P cores is not based on the
Graham-Brent theorem.
Many works, such as [14, 12], targeting code opti-
mization and performance prediction of GPU programs
are related to our work. However, these papers do not
define an abstract model in support of the analysis of
algorithms.
In this paper, we propose a many-core machine model
(MMM) which aims at minimizing parallelism over-
heads of algorithms targeting implementation on GPUs.
In the design of this model, we insist on the following
features:
- Two-level DAG programs. Defined in Section 2,
this aspect captures the two levels of parallelism
(fork-join and SIMD) of heterogeneous programs
(like a CilkPlus program using #pragma simd [18]
or a CUDA program with the so-called dynamic
parallelism [17]).
- Parallelism overhead. We introduce this complex-
ity measure in Section 2.3 with the objective of an-
alyzing communication and synchronization costs.
- A Graham-Brent theorem. We combine three com-
plexity measures (work, span and parallelism over-
head) and two machine parameters (size of local
memory and data transfer throughput) in order to
estimate the running time of an MMM program on
P streaming multiprocessors. This result is Theo-
rem 1 in Section 2.4.
Our proposed model extends both the fork-join concur-
rency model and PRAM-based models, with an empha-
sis on parallelism overheads resulting from communica-
tion and synchronization costs.
We sketch below how, in practice, we use this model
in order to minimize parallelism overheads of programs
targeting GPUs. Consider an MMM program P, that
is, an algorithm expressed in the MMM model. As-
sume that a program parameter s (like the number of
threads per thread-block or the amount of data trans-
fer per thread-block) can be arbitrarily chosen within
some range S while preserving the specifications of P.
Let s0 be a particular value of s which corresponds to
an instance P0 of P that, in practice, can be seen as a
naive (or simply initial) version of the algorithm.
Assume that, when s varies within S, the work, say
WP(s), does not vary much, that is, WP(s0)/WP(s)
∈ Θ(1) hold. Assume also that the parallelism overhead
OP(s) varies more substantially, say OP(s0)/OP(s) ∈
Θ(s− s0). Then, we determine a value smin ∈ S maxi-
mizing the ratio OP(s0)/OP(s). Next, we use our ver-
sion of Graham-Brent theorem (more precisely, we use
Corollary 1) to check that the upper bound for the run-
ning time of P(smin) is less than that of P(so). If this
holds, we view P(smin) as a solution of our problem
of algorithm optimization (in terms of parallelism over-
heads).
To demonstrate and evaluate the benefits of our
model, and the above optimization strategy, we applied
them successfully to four fundamental algorithms in sci-
entific computing, see Sections 3 to 6. These four ap-
plications are the univariate polynomial division and
multiplication, radix sort and the Euclidean algorithm.
Each of them satisfies the hypotheses of the above opti-
mization strategy. Our theoretical analysis, for three of
these applications, has led to an optimized implemen-
tation reported in [9] and publicly available1. while,
for the other application, this analysis has explained a
posteriori the experimental observations of [19].
2 A many-core machine model
The model of parallel computations presented in this
paper aims at capturing communication and synchro-
nization overheads of programs written for modern
many-core architectures. One of our objectives is to
optimize algorithms by techniques like reducing redun-
dant memory accesses. The reason for this optimization
is the fact that, on actual GPUs, global memory latency
is approximately 400 to 800 clock cycles, while local
memory latency is only a few clock cycles. This mem-
ory latency difference, when not properly taken into ac-
count, may have a dramatically negative impact on pro-
gram performance. As mentioned in the introduction,
this hardware feature of GPUs cannot be captured by
1 Our algorithms are implemented in CUDA and publicly
available with benchmarking scripts from http://www.cumodp.
org/.
2
the well-studied PRAM model. Indeed, any memory
access, as well as any integer arithmetic operation, is
performed in unit time on a PRAM machine.
This latter, as well as other limitations, have mo-
tivated variants of the PRAM model, including our
work. Another motivation, mentioned above, is the fact
that popular concurrency platforms offer task-based
and data-based parallelisms, with language constructs
specific to each case.
As specified in Sections 2.1 and 2.2, our many-core
machine model (MMM) retains many of the character-
istics of modern GPU architectures programming mod-
els like CUDA or OpenCL. However, in order to sup-
port algorithm analysis with an emphasis on parallelism
overheads, as defined in Section 2.3 and 2.4, MMM ab-
stract machines admit a few simplifications and limita-
tions with respect to actual many-core devices.
2.1 Characteristics of the abstract
many-core machines
Figure 1: Overview of a many-core machine program
Architecture. An MMM abstract machine possesses an
unbounded number of streaming multiprocessors (SMs)
which are all identical. Each SM has a finite number
of processing cores and a fixed-size local memory. An
MMM machine has a two-level memory hierarchy, com-
prising an unbounded global memory with high latency
and low throughput while SMs local memories have low
latency and high throughput.
Programs. An MMM program is a directed acyclic
graph (DAG) whose vertices are kernels (defined here-
after) and edges indicate serial dependencies, similarly
to the instruction stream DAGs of the fork-join mul-
tithreaded concurrency model. A kernel is an SIMD
(single instruction, multiple data) program capable of
branches and decomposed into a number of thread-
blocks. Each thread-block is executed by a single SM
and each SM executes a single thread-block at a time.
Similarly to a CUDA program, an MMM program spec-
ifies for each kernel the number of thread-blocks and the
number of threads per thread-block, following the same
extended function-call syntax. Figure 1 depicts the dif-
ferent types of components of an MMM program.
Scheduling and synchronization. At run time, an
MMM machine schedules thread-blocks (from the same
or different kernels) onto SMs, based on the dependen-
cies specified by the edges of the DAG and the hard-
ware resources required by each thread-block. Threads
within a thread-block can cooperate with each other
via the local memory of the SM running the thread-
block. Meanwhile, thread-blocks interact with each
other via the global memory. In addition, threads
within a thread-block are executed physically in paral-
lel by an SM. Moreover, the programmer cannot make
any assumptions on the order in which thread-blocks of
a given kernel are mapped to the SMs. Hence, MMM
programs run correctly on any fixed number of SMs.
Memory access policy. All threads of a given thread-
block can access simultaneously any memory cell of the
local memory or the global memory: read/write con-
flicts are handled by the CREW (concurrent read, ex-
clusive write) policy. However, read/write requests to
the global memory by two different thread-blocks can-
not be executed simultaneously. In case of simultane-
ous requests, one thread-block is chosen randomly and
served first, then the other is served.
For the purpose of analyzing program performance,
we define two machine parameters:
U : Time (expressed in clock cycles) to transfer one
machine word between the global memory and the
local memory of any SM.
Z: Size (expressed in machine words) of the local
memory of any SM.
Thus, the quantity 1/U is a throughput measure and
has the following property. If α and β are the numbers
of words respectively read and written to the global
memory by one thread of a thread-block B, then the
total time TD spent in data transfer between the global
memory and the local memory of an SM executing B
satisfies
TD ≤ (α+ β)U. (1)
3
We observe that, on actual machines, some hardware
characteristics may reduce data transfer time (like co-
alesced accesses to global memory) or the contribution
of the latter on the overall running time (like concur-
rent execution of thread-blocks on the same SM with
fast context switching in order to hide data transfer la-
tency). Other hardware characteristics (like partition
camping) may increase data transfer time. However,
this reduced or increased transfer time will remain in
O(α + β). Therefore, the fact that MMM machines do
not have these hardware characteristics will not lead us
to incorrect asymptotic upper bounds when estimating
the running time of MMM programs.
Similarly, the local memory size Z unifies different
characteristics of an SM and, thus, of a thread-block.
Indeed, each of the following quantities is at most equal
to Z: the number of cores of an SM, the number of
threads of a thread-block, the number of words in a
data transfer between the global memory and the local
memory of an SM.
Relation (1) calls for another comment. One could
expect the introduction of a third machine parameter,
say V , which would be the time to execute one local
operation (arithmetic operation, read/write in the local
memory), such that, if σ is the total number of local
operations performed by one thread of a thread-block
B, then the total time TA spent in local operations by
an SM executing B would satisfy TA ≤ σV. Therefore,
for the total running time T of the thread-block B, we
would have
T = TA + TD ≤ σV + (α+ β)U. (2)
Instead of introducing this third machine parameter V ,
we let V = 1. In other words, U is the ratio between
the time to transfer a machine word and the time to
execute a local operation. We assume U > 1.
Finally, Relation (2) requires another justification.
Indeed, this estimate suggests that, at every clock cycle,
every thread is either performing an arithmetic opera-
tion or accessing memory. For this to be true, condi-
tional branches responsible for code divergence [8] need
to be eliminated by techniques like code replication [20],
such that all threads execute the same code path. How-
ever, in a sake of clarity, we will not perform this trans-
formation in the MMM algorithms stated in the subse-
quent sections. In fact, we verified experimentally that
the impact of code divergence on the performance of
our algorithms is negligible.
2.2 Many-core machine programs
Recall that each MMM program P is a DAG (K, E),
called the kernel DAG of P, where each node K ∈ K
represents a kernel, and each edge E ∈ E records the
fact that a kernel call must precede another kernel call.
In other words, a kernel call can be executed provided
that all its predecessors in the DAG (K, E) have com-
pleted their execution.
Synchronization costs. Recall that each kernel decom-
poses into thread-blocks and that all threads within
a given kernel execute the same serial program, but
with possibly different input data. In addition, all
threads within a thread-block are executed physically
in parallel by an SM. It follows that MMM kernel
code needs no synchronization statement, like CUDA’s
syncthreads(). Consequently, the only form of syn-
chronization taking place among the threads executing
a given thread-block is that implied by code divergence.
This latter phenomenon can be seen as parallelism over-
head. As mentioned, an MMM machine handles code
divergence by eliminating the corresponding conditional
branches via code replication2 and the corresponding
cost will be captured by the complexity measures (work,
span and parallelism overhead) of the MMM model.
Since no synchronization occurs between thread blocks,
we turn our attention to scheduling.
Scheduling costs. Since an MMM abstract machine
has infinitely many SMs and since the kernel DAG defin-
ing an MMM program P is assumed to be known when
P starts to execute, scheduling P’s kernels onto the SMs
can be done in time O(Γ) where Γ is the total length
of P’s kernel code. Thus, we shall neglect those costs
in comparison to the costs of transferring data between
SMs’ local memories and the global memory. We also
note that assuming that, the kernel DAG is known when
P starts to execute, allows us to focus on parallelism
overheads resulting from this data transfer. Extending
MMM machines to support programs whose instruction
stream DAGs unfold dynamically at run time and in-
tegrating the resulting scheduling costs as in [10] is left
for future work.
Thread block DAG. Since each kernel of the program
P decomposes into a finite number of thread-blocks, we
map P to a second graph, called the thread block DAG
of P, whose vertex set B(P) consists of all thread-blocks
2 Conditional branch elimination in SIMD code is dis-
cussed in https://www.kernel.org/pub/linux/kernel/people/
geoff/cell/ps3-linux-docs/CellProgrammingTutorial/
BasicsOfSIMDProgramming.html.
4
of the kernels of P and such that (B1, B2) is an edge
if B1 is a thread-block of a kernel preceding the kernel
of the thread-block B2 in P. This second graph defines
two important quantities:
N(P): number of vertices in the thread-block DAG of
P,
L(P): critical path length (where length of a path is the
number of edges in that path) in the thread-block
DAG of P.
2.3 Complexity measures for the many-
core machine model
Consider an MMM program P given by its kernel DAG
(K, E). Let K ∈ K be any kernel of P and B be any
thread-block of K. We define the work of B, denoted
by W (B), as the total number of local operations per-
formed by all threads of B. We define the span of B,
denoted by S(B), as the maximum number of local op-
erations performed by a thread of B. Let α and β be the
maximum numbers of words read and written (from the
global memory) by a thread of B. Then, we define the
overhead of B, denoted by O(B), as (α + β)U . Next,
the work (resp. overhead) W (K) (resp. O(K)) of the
kernel K is the sum of the works (resp. overheads) of
its thread-blocks, while the span S(K) of the kernel K
is the maximum of the spans of its thread-blocks.
We consider now the entire program P. The work
W (P) of P is defined as the total work of all its kernels
W (P) =
∑
K∈K
W (K).
Regarding the graph (K,E) as a weighted-vertex graph
where the weight of a vertex K ∈ K is its span S(K),
we define the weight S(γ) of any path γ from the
first executing kernel to the last executing kernel as
S(γ) =
∑
K∈γ S(K). Then, we define the span S(P)
of the program P as
S(P) = max
γ
S(γ).
Finally, we define the overhead O(P) of the program P
as the total overhead of all its kernels
O(P) =
∑
K∈K
O(K).
Observe that, according to Mirsky’s theorem [15], the
number pi of parallel steps in P (i.e. anti-chains in
(K, E)) is equal to the maximum length of a path in
(K, E) from the first executing kernel to the last execut-
ing kernel.
2.4 A Graham-Brent theorem with
overhead
Theorem 1 We have the following estimate for the
running time TP of the program P when executed on
P SMs:
TP ≤ (N(P)/P + L(P))C(P) (3)
where C(P) = maxB∈B(P) (S(B) +O(B)),
The proof is similar to that of the original result. One
observes that the total number of complete steps (for
which P thread-blocks can be scheduled by a greedy
scheduler) is at most N(P)/P while the number of in-
complete steps is at most L(P). Finally, C(P) is an ob-
vious upper bound for the running time of every step,
complete or incomplete.
The proof of the following corollary follows from
Theorem 1 and from the fact that costs of scheduling
thread-blocks onto SMs are neglected.
Corollary 1 Let K be the maximum number of thread
blocks along an anti-chain of the thread-block DAG of P.
Then the running time TP of the program P satisfies:
TP ≤ (N(P)/K + L(P))C(P) (4)
Corollary 1 allows us to estimate the running time of
an MMM program as a function of the machine param-
eters Z, U and the thread-block DAG of P. Thus this
estimate does not depend on the number of SMs in use
to execute P.
3 Polynomial division
Our first application of the MMM deals with univariate
polynomial division. One division step, like a Gaus-
sian elimination step, performs a linear combination of
two vectors. Hence, the ideas developed in this section
would apply to linear algebra and are not specific to
polynomial arithmetic.
Let K be a field and a, b ∈ K[X] be univariate poly-
nomials with coefficients in K, with b 6= 0. Assume
that each arithmetic operation (addition, subtraction,
multiplication and division) in K can be done with a sin-
gle machine word operation on an MMM machine. Let
5
n and m be non-negative integers such that we have
deg(a) = n− 1 and deg(b) = m− 1. Thus, n and m are
the number of terms (null or not) of a and b, respec-
tively. Let q and r be the quotient and the remainder
in the Euclidean division of a by b. Thus, (q, r) is a
unique couple of univariate polynomials over K such
that a = q · b+ r and deg(r) < deg(b) both hold.
In Section 3.1, we present a multithreaded algorithm
computing (q, r) on an MMM machine. We call this
algorithm naive since it implements the natural idea
that, at each division step, each thread computes one
coefficient of the next intermediate remainder. In Sec-
tion 3.2, we propose a second MMM algorithm with the
goal of minimizing data transfer. We analyze both al-
gorithms with the complexity measures 3 of Section 2.3.
In Section 3.3, we compare their running time estimates
given by Corollary 1.
3.1 Naive algorithm
In each kernel call, each thread computes one coefficient
of an intermediate remainder polynomial by means of
one multiplication and one subtraction in the coefficient
field K. Algorithm 1 repeatedly calls the kernel stated
in Algorithm 2. The latter performs one division step
in parallel. Let ` be the number of threads in a thread-
block, we note that each kernel uses dm` e thread-blocks.
We observe that each thread of a kernel reads/writes 3
to 5 words4 in the global memory without storing them
in the local memory.
Figure 2: Naive division: illustration of a thread-block
reading coefficients from a,b and writing to r.
We also notice that Algorithm 1 performs exactly
n − m + 1 consecutive calls to Algorithm 2. Never-
theless, Algorithm 2 works correctly even if, after one
division step, the degree of an intermediate remainder
drops by more than one. This implementation choice is
3See the detailed analysis in the form of executable
Maple worksheet: http://www.csd.uwo.ca/~nxie6/projects/
mmm/division_overall.mw
4 Indeed, in each thread-block, all active threads read 2 coef-
ficients of a and writes one back; moreover, the thread with ID 0
computes and writes back one coefficient of q.
relevant to dense polynomials, which are our primary
interest. In the sparse case, the degree of an intermedi-
ate remainder needs to be computed after each division
step. Figure 2 shows a division step within a thread-
block after running Algorithm 2 once.
Algorithm 1: NaivePlainDivisionGPU(a, b)
Input: a, b ∈ K[X] with deg(a) ≥ deg(b) that is,
n− 1 ≥ m− 1.
Output: q, r ∈ K[X] s. t. a = q · b+ r and
deg(r) < m− 1.
1 Let ` be the number of threads in a thread block;
2 Let q be array of size n−m+ 1 with coefficients in
K;
3 for (i = (n− 1) . . . (m− 1)) do
4 NaiveDivKernel≪ dm/`e, `≫ (a, b, q, i,m− 1);
5 if a[0] == · · · == a[m− 1] == 0 then
6 return [q, 0];
7 Compute k, the maximum i such that a[i] 6= 0
holds;
8 Let r be array of size k + 1 s.t. r[i] = a[i] for
0 ≤ i ≤ k;
9 return [q, r];
Algorithm 2: NaiveDivKernel(a, b, q, i, db)
Input: a, b, q ∈ K[X], db = deg(b), i ∈ N, db ≤ i.
1 Let blockID, blockDim, threadID be the block id,
number of threads per block, thread id respectively;
2 j =blockID·blockDim + threadID;
3 if j ≤ db then
4 if j == 0 then
/* writing to global memory */
5 q[i− db] = (b[db])−1 · a[i];
/* updating a in global memory */
6 a[j + i− db] -= b[j] · (b[db])−1 · a[i];
We denote by W1, S1 and O1, the work, span and
overhead of Algorithm 1, respectively. Since each
thread-block performs 2 `+1 arithmetic operations and
each thread makes at most 5 accesses to the global
memory, we obtain W1 =
(n−m+1)m (2 `+1)
` , S1 =
3 (n−m+1) and O1 = 5 (n−m+1)mU` . To apply Corol-
lary 1, we shall compute the quantities N(P), L(P)
and C(P) defined in Section 2. We denote them here
by N1, L1 and C1, respectively. One can easily check
that we have N1 =
(n−m+1)m
` , L1 = n − m + 1 and
C1 = 3 + 5U .
6
3.2 Optimized algorithm
In each kernel, each thread updates a number of co-
efficients (instead of just one) of an intermediate re-
mainder polynomial repeatedly during a number of di-
vision steps, thus without synchronizing data with other
thread-blocks. The motivation of this new scheme is to
minimize the amount of data transferred between global
and local memories. Similarly to the scheme in Sec-
tion 3.1, Algorithm 3 repeatedly calls Algorithm 4. Fig-
ure 3 shows s division steps within a thread-block after
running Algorithm 4 once. More specifically, given an
integer s ≥ 1, Algorithm 4 performs sufficiently many
division steps (at most s) such that the output polyno-
mial r is either zero or its degree is less than that of a
at least by s. To this end, each thread-block
• uses 3 s threads,
• loads the coefficients of Xd, Xd−1, . . ., Xd−s+1
from a (resp. b), that we call the s-head of a (resp.
b), where d is the degree of a (resp. b), see Lines
3-4,
• loads 2 s (resp. 3 s) consecutive coefficients of a
(resp. b), say Xd1 , Xd1−1, . . ., Xd1−2s+1 (Xd2 ,
Xd2−1, . . ., Xd2−3s+1) for some integer d1 > 0
(resp. d2 > 0) which depends on the thread and
thread-block IDs.
Figure 3: Optimized division: illustration of a thread-
block reading coefficients from a,b and writing to r.
The s-heads of a and b are used to keep track of the
leading coefficient of an intermediate remainder during
the entire execution of a kernel, see Lines 14-15 of Algo-
rithm 4. This task is achieved by the first s threads of
a thread-block and requires s+(s−1)+ · · ·+1 = s(s+1)2
arithmetic operations. Meanwhile, the other 2 s threads
update 2s coefficients of a, see Lines 16-17, which takes
2 s · 2 s arithmetic operations. Finally, at Lines 12-13
(resp. 18-19) the quotient q (resp. the intermediate
remainder a) is updated in the global memory.
We denote the work, span and overhead of the op-
timized algorithm by Ws, Ss and Os, respectively.
Since each thread makes at most 9 accesses to the
global memory, we obtain Ws =
(n−m+1)m (9 s+1)
4 s ,
Ss = 3 (n −m + 1) and Os = 9 (n−m+1)mU2 s2 . To ap-
ply Corollary 1, we shall compute the quantities N(P),
L(P) and C(P) defined in Section 2. We denote them
here by Ns, Ls and Cs, respectively. One can easily
check that we have Ns =
(n−m+1)m
2 s2 , Ls =
n−m+1
s and
Cs = 3 s+ 9U .
Algorithm 3: OptimizePlainDivisionGPU(a, b, s)
Input: a, b ∈ K[X] with deg(a) ≥ deg(b) that is,
n− 1 ≥ m− 1 and s ∈ N.
Output: q, r ∈ K[X] s. t. a = q · b+ r and
deg(r) < m− 1.
1 Let ` = 3s be the number of threads in a thread
block;
2 Let q be array of size n−m+ 1 with coefficients in
K;
3 for (i = n− 1; i ≥ m− 1; i = i− s) do
4 OptDivKer≪ dm/(2s)e, `≫ (a, b, q, i,m− 1, s);
5 if a[0] == · · · == a[m− 1] == 0 then
6 return [q, 0];
7 Compute k, the maximum i such that a[i] 6= 0
holds;
8 Let r be array of size k + 1 s.t. r[i] = a[i] for
0 ≤ i ≤ k;
9 return [q, r];
3.3 Comparison of running time esti-
mates
Before following the algorithm optimization strategy
stated in the introduction, we replace ` and s by Z/2
and Z/7, respectively, since 2` or 7s coefficients must
fit into the local memory, that is, 2` ≤ Z and 7s ≤ Z.
Now, we observe that the work ratio W1/Ws is asymp-
totically constant:
W1
Ws
=
8 (Z + 1)
9Z + 7
, (5)
and S1/Ss = 1. Then, we compute the overhead ratio:
O1
Os
=
20
441
Z. (6)
We observe that this substantial improvement in data
transfer overhead is done at a fairly low expense in
terms of work overhead. Next, applying Corollary 1, the
7
Algorithm 4: OptDivKer(a, b, q, i, db, s)
Input: a, b, q ∈ K[X], i ∈ N, db = deg(b) and s ∈ N.
1 Let sAc, sBc, sA, sB be local arrays of size
s, s, 2s, 3s respectively each with coefficients in K;
2 j =blockID·blockDim + threadID; t = threadID;
/* Reading from global memory */
3 if t < s then
4 sAc[t] = a[i− t]; sBc[t] = b[db − t];
sB[t] = b[db − 2s blockID−t];
5 if t ≥ s then
6 sA[t− s] = a[i− s− 2s blockID−t];
sB[t] = b[db − 2s blockID−t];
7 for (k = 0; (k < s) ∧ (i+ k ≥ d); k = k + 1) do
8 while (k < s) ∧ ( sAc[k] == 0) do
9 k = k + 1;
10 if k ≥ s then
11 break;
12 if j == 0 then
/* Writing q to global memory */
13 q[i− db − k] = sAc[k] · b[db]−1;
14 if k ≤ t < s then
15 sAc[t] -= sBc[t− k] · sAc[k] · b[db]−1;
16 if t ≥ s then
17 sA[t− s] -= sB[t− k] · sAc[k] · b[db]−1;
18 if t ≥ s then
/* Writing back a to global memory */
19 a[i− s− 2s blockID−t] = sA[t− s];
running time estimate of the naive and optimized algo-
rithms are bounded over, respectively by T1 = (N1/K1+
L1) ·C1 with K1 = m` and Ts = (Ns/Ks +Ls) ·Cs with
Ks =
m
2 s . We compute the ratio R = T1/Ts, that is,
R =
(3 + 5U)Z
3 (Z + 21U)
. (7)
We observe that R is larger than 1 if and only if Z >
12.6 holds. The above condition clearly holds on actual
GPU architectures. Thus the optimized algorithm (that
is for s > 1) is overall better than the naive one (that
is for s = 1). This is verified in practice in [9], where s
is set to 512.
4 Radix sort
In [19], the authors present a CUDA implementation of
the radix sort algorithm. Assuming that all entries are
non-negative integers of bit-size c, this CUDA imple-
mentation sorts n entries by performing cs passes where
s is a program parameter. In each pass, each thread-
block first loads and sorts its tile using s iterations of
1-bit split, and write back its 2s-entry digit histogram
and sorted data. Then, it performs a prefix sum over
the histogram stored in a column-major order. Finally,
each thread-block copies its elements to the correct out-
put position.
Let ` be the number of threads per thread-block. Fol-
lowing [19], we assume that each thread deals with 4
elements. Then, for each thread-block, 4 ` original el-
ements, 4 ` sorted elements and a 2s-entry digit his-
togram must fit into the local memory, hence 8 `+ 2s ≤
Z. Thus, the maximum overhead per thread-block is
9U (loading 4 elements, writing back 4 sorted elements
and 1 value of the histogram 5).
We compute the work, span and overhead, re-
spectively as Ws = c
(
22 s `+s+12
4 s ` +
2s+20 2s `
16 s `2 + 1
)
n +
c (16+192 `)
16 s , Ss = c
(
8 log2 `+
16
s log2 `+ 41 +
54
s
)
,
Os = cU
(
9
2 s ` n+
17 2s
16 s `2 n− 1s
)
.
We view the case s = 1 as a naive radix sort algo-
rithm, with work, span and overhead given by W1, S1
and O1, respectively. Letting n escaping to infinity, the
work ratio W1/Ws is asymptotically equivalent to:
W1
Ws
∼ 104 s `
2 + 92 s `+ 2 s
88 s `2 + 16 `2 + 20 2s `+ 4 s `+ 48 `+ 2s
. (8)
Similarly, the span ratio S1/Ss =
s (24 log2 `+95)
(8 s+16) log2`+41 s+54
is
asymptotically constant, meanwhile the overhead ratio
is:
O1
Os
∼ s (72 `+ 34)
72 `+ 17 2s
. (9)
We notice that if 2s = Θ(`) holds, we can reduce the
overhead by a factor of s while increasing the work by
a constant factor only. Meanwhile, with 2s = Θ(`2),
we increase both the work and the overhead by a non-
constant factor. In this latter scenario, we could not
optimize the naive algorithm in any sense.
To apply Corollary 1, we compute the three quanti-
ties Ns =
c
s
(
1
2 ` +
2s
8 `2
)
n, Ls =
5 c
s and Cs = s (41+
8 log2 `) + 12 + 9U . Then, the ratio R of the running
time estimate between the naive algorithm and that for
an arbitrary s is:
R =
(14 `+ 2) (53 + 8 log2 `+ 9U) s
(14 `+ 2s) (41 s+ 8 s log2 `+ 12 + 9U)
.
5See the detailed analysis in the form of executable
Maple worksheet: http://www.csd.uwo.ca/~nxie6/projects/
mmm/sorting_overall.mw
8
We then replace s by log2 `, since we would like to de-
termine whether the overall running time is better or
worse in this case. The quotient of the leading terms in
R becomes
7 ` log2 ` (8 log2 `+ 9U)
60 ` log22 `
. (10)
This ratio is larger than 1 for ` < 215.75U , which is re-
alistic. Therefore, letting 2s = Θ(`), the data transfer
overhead is reduced by a factor of s and leads to an op-
timized algorithm. This is consistent with the empirical
results of [19].
5 Polynomial multiplication
Let a and b be polynomials as in Section 3, and f = a×b
be their product. Our multiplication algorithm is based
on the well-known long multiplication6 and consists of
two phases. During the multiplication phase, every co-
efficient of a is multiplied with every coefficient of b; the
resulting products are accumulated in an intermediate
array, denoted by M . Then, during the addition phase,
these accumulated products are added together to form
the polynomial f .
For this application, the program parameter (as de-
fined in the introduction) is an integer s > 0, represent-
ing for each thread-block, the number of coefficients of
b to be multiplied by a number of coefficients of a in the
coefficients multiplication phase, as well as the number
of sums per thread in the addition phase. 7 Algorithm 5
is the top level algorithm: it performs the multiplica-
tion phase via Algorithm 6, and the addition phase by
repeated calls to Algorithm 7.
We denote by ` the number of threads per thread-
block. In multiplication phase, each thread-block reads
s `+ s− 1 coefficients of a, s coefficients of b, computes
` s2 products, followed by ` s (s−1) of additions. Thus,
each thread-block contributes s ` partial sums to the
two-dimensional array M , whose format is x · y, where
x = ms and y = n + s − 1. This multiplication phase,
illustrated by Figure 4, loads s ` + s − 1 coefficients of
a to guarantee the correctness of the results in M .
In the addition phase, the x rows of the auxiliary ar-
ray M are added pairwise in log2 x parallel steps. After
6http://en.wikipedia.org/wiki/Multiplication_
algorithm
7See the detailed analysis in the form of executable
Maple worksheet: http://www.csd.uwo.ca/~nxie6/projects/
mmm/multiplication_overall.mw
each step, the number of rows in M is reduced by half,
until we obtain only one row that is, f = a × b. To be
more specific, when adding rows i and j (for i < j) at
a given parallel step, shown in Figure 5, each thread-
block loads s ` elements of M [i] and M [j], respectively,
and then adds M [j] to M [i].
Figure 4: Multiplication phase: illustration of a thread-
block reading coefficients from a,b and writing to the
auxiliary array M .
Figure 5: Addition phase: illustration of a thread-block
reading and writing to the auxiliary array M .
Algorithm 5: PlainMultiplicationGPU(a, b, f, s)
Input: a, b ∈ K[X] with n− 1 := deg(a) and
m− 1 := deg(b) and an integer s ≥ 1.
Output: f ∈ K[X] and f = a× b.
1 y = n+ s− 1; x = m/s;
2 Let M be an array of size x · y with coefficients in
K;
3 ` is the number of threads per block;
4 MulKer≪ x · y/(s · `), `≫ (a, b,M, n,m, s);
5 for (i = 0; i ≤ log2 s; i = i+ 1) do
6 AddKer≪ x · y/(2i+1 s · `), `≫ (M,f, y, s, x, i);
7 return f ;
Considering any thread-block of the multiplication
phase, we notice that s coefficients of b and s `+s−1 co-
efficients of a are loaded and s ` results are written back
to global memory. Hence 2 s `+2 s−1 coefficients must
9
Algorithm 6: MulKer(a, b,M, n,m, s)
Input: a, b,M ∈ K[X] and an integer s ≥ 1.
1 ` = blockDim; t = threadID; j =blockID·`+ t;
2 Let B′ and A′ be two local arrays in K of size s
and ` · s+ s− 1 respectively;
/* copying from global */
3 i = s · bs · j/(n+ s− 1)c+ t;
4 if i < m ∧ t < s then
5 B′[t] = b[i];
6 i = s · (j mod n+s−1s );
7 for (k = 0; k < s; k = k + 1) do
8 if i+ k · `+ t < n then
9 A′[k · `+ t] = a[i+ k · `+ t];
10 if i− s+ t > 0 ∧ t < s− 1 then
11 A′[` · s+ t] = a[i− s+ t];
12 else if t < s− 1 then
13 A′[` · s+ t] = 0;
14 for (e = 0; e < s; e = e+ 1) do
15 h = 0;
16 for (k = 0; k < s; k = k + 1) do
17 h += A′[e · `+ k] ·B′[k];
/* writing to global memory */
18 M [s · j + e] = h;
fit into local memory, that is, we have 2 s `+2 s−1 ≤ Z.
Next, we compute the work, span and parallelism over-
head, respectively as Ws = (2m− 12 ) (n+ s− 1), Ss =
2 s2 + s log2
m
s − s and Os = (n+s−1) (5ms+2m−3 s
2)U
s2 ` .
We also obtain the quantities characterizing the thread
block DAG that are required in order to apply Corol-
lary 1: Ns =
(n+s−1) (2m−s)
s2 ` , Ls = log2
m
s + 1 and
Cs = s (2 s− 1) + 2U (s+ 1).
We set s = 1 in Algorithm 5, and view it as a “naive
algorithm”. The work ratio W1/Ws =
n
n+s−1 , is asymp-
totically constant as n escapes to infinity. The span
ratio S1/Ss =
log2m+1
s (log2 (m/s)+2 s−1) shows that Ss grows
asymptotically with s. The parallelism overhead ratio,
letting m = n:
O1
Os
=
n s2 (7n− 3)
(n+ s− 1) (5n s+ 2n− 3 s2) . (11)
We observe that, as n escape to infinity, this latter ratio
is asymptotically equivalent to s. Applying Corollary 1,
let R be the ratio of the running time estimate between
the naive algorithm and that for an arbitrary s. We
Algorithm 7: AddKer(M,f, y, s, x, i)
Input: M,f,∈ K[X] and y, s, x, i are positive
integers.
1 j =blockID·blockDim + threadID;
2 h = s · j mod y;
3 k = 2i − 1 + 2i+1 bs · j/yc;
4 if h < 2i s then
5 for (e = 0; e < s; e = e+ 1) do
6 f [k · s+ h+ e] += M [k · x+ h+ e];
7 else
8 for (e = 0; e < s; e = e+ 1) do
9 M [(k + 2i) · x+ h− 2i s+ e] +=
M [k · x+ h+ e];
obtain
R =
(n log2 n+ 3n− 1) (1 + 4U)
(n log2
n
s + 3n− s) (2U s+ 2U + 2 s2 − s)
,
(12)
which is essentially 2 log2 ns log2 (n/s)
. This latter ratio is
smaller than 1, such that the “initial” algorithm (that
is for s = 1) performs better. This also indicates that
increasing s makes the algorithm performance worse. In
practice shown in Table 1, setting s = 4 performs best,
while with larger s, the running time becomes slower,
which is coherent with our theoretical analysis.
6 The Euclidean algorithm
for polynomials
Let a and b be polynomials as defined in Section 3. In
Section 6.1, we present a simple multithreaded algo-
rithm that computes GCD(a, b) on an MMM machine.
We call it naive (like Algorithm 1) as this algorithm
also performs one division step within one kernel. In
Section 6.2, we describe another algorithm on an MMM
machine that reduces the overhead of data transfer by
performing several division steps within one kernel. Fi-
nally, in Section 6.3 we compare those two algorithms
by means of Corollary 1. A detailed analysis in the form
of executable Maple worksheet is available at8.
8http://www.csd.uwo.ca/~nxie6/projects/mmm/euclidean_
overall.mw
10
6.1 Naive algorithm
Algorithm 8 computes GCD(a, b). Similarly to the naive
division of Algorithm 1, it calls a kernel (stated as Algo-
rithm 9) that completes one division step. The former
algorithm calls the latter one at most n+m− 2 times.
Algorithm 9 is the same as Algorithm 2 except that it
checks the current degrees of both a and b so as to de-
cide which one takes the role of the divisor, and then it
completes a division step. To do so, we use an array st
of length 2 to store the degrees of a and b during each
division step. The fact that the degree of an intermedi-
ate remainder may be dropped by more than 1 is taken
into account, so that Algorithm 9 works correctly even
if the GCD is computed before n + m − 2 kernel calls.
After n + m − 2 division steps, either a or b becomes
zero or constant. Algorithm 8 returns the other poly-
nomial as GCD. Figure 6 shows one Euclidean division
step within a thread-block after running Algorithm 9
once.
Figure 6: Naive Euclidean: illustration of a thread-
block reading coefficients from a,b and writing to g.
We denote by ` the number of threads per thread-
block. We compute the work, span and overhead, re-
spectively as W1 =
m (2n `+n+`−1)
` , S1 = 3 (m + n − 2)
and O1 =
5mU (n+`+1)
` . To apply Corollary 1, one
can easily check that those three quantities are N1 =
m (n+`+1)
` , L1 = m+ n− 2 and C1 = 3 + 5U .
6.2 Optimized algorithm
In each kernel, thread-blocks collectively perform at
most s division steps, instead of one, and update s
coefficients from both a and b. After a division step,
the degree of the dividend polynomial is decreased by
at least one, and then in the next division step, co-
efficients from the divisor polynomial are adjusted by
Algorithm 8: NaivePlainGcdGPU(a, b)
Input: a, b ∈ K[X] with deg(a) ≥ deg(b) that is,
n− 1 ≥ m− 1.
Output: g ∈ K[X], s.t. g =GCD(a, b).
1 int st[] = {deg(a),deg(b)};
2 Let ` be the number of threads in a thread block;
3 for (i = 0; i < n+m− 2; i = i+ 1) do
4 NaivePlainGcdKernel≪ dm/`e, `≫ (a, b, st);
5 if a is a zero or constant polynomial then
6 Compute kb the maximum i s.t. b[i] 6= 0 holds;
7 Let g be array of size kb + 1 with coefficients in
K s.t. g[i] = b[i] for 0 ≤ i ≤ kb;
8 else
9 Compute ka the maximum i s.t. a[i] 6= 0 holds;
10 Let g be array of size ka + 1 with coefficients in
K s.t. g[i] = a[i] for 0 ≤ i ≤ ka;
11 return g;
one or more shift operations. Thus, we need 2 s coeffi-
cients from both a and b to be sure that after s division
steps we correctly have s coefficients of both a and b.
Since the consecutive thread-blocks have s common co-
efficients of both a and b, the number of thread-blocks
is min(deg(a)+1s ,
deg(b)+1
s ). A thread-block also needs
s-head coefficients of both a and b to run s division
steps. Thus, each thread-block uses 3 s threads (like
Algorithm 4). Algorithm 10 is our top level optimized
algorithm for computing GCD(a, b). Figure 7 shows
s Euclidean division steps within a thread-block after
running the kernel Algorithm 11 once.
Figure 7: Optimized Euclidean: illustration of a thread-
block reading coefficients from a,b and writing to ga,gb.
We compute the work, span and overhead, respec-
tively as Ws = (
9
4 +
6
s )m
2+
(
9
2 n+
n
2 s +
87
8 s+
23
2
)
m−
345
16 s
2 − 774 s, Ss = 3n + 3m and Os = 8mU (n+s)s2 .
To apply Corollary 1, one can easily check that those
11
Algorithm 9: NaivePlainGcdKernel(a, b, st )
Input: a, b ∈ K[X] and st[] stores the current degree
of a and b.
1 j =blockID·blockDim + threadID;
2 if st[0] ≥ st[1] > 0 ∧ j < st[1] then
3 k = j+st[0]−st[1];
4 a[k] = a[k]− b[j] · a[st[0]] · b[st[1]]−1;
5 if j == st[1]− 1 then
6 while (st[0] ≥ 0) ∧ (a[st[0]] = 0) do
7 st[0] = st[0]− 1;
8 else if 0 < st[0] < st[1] ∧ j < st[0] then
9 k = j+st[1]−st[0];
10 b[k] = b[k]− a[j] · b[st[1]] · a[st[0]]−1;
11 if j == st[0]− 1 then
12 while (st[1] ≥ 0) ∧ (b[st[1]] = 0) do
13 st[1] = st[1]− 1;
three quantities are Ns =
mn
s2 +
m
s , Ls =
n
s +
m
s and
Cs = 3 s+ 8U .
Algorithm 10: OptimizedPlainGcdGPU(a, b, s)
Input: a, b ∈ K[X] with deg(a) ≥ deg(b) that is,
n− 1 ≥ m− 1 and an integer s > 1.
Output: g ∈ K[X], s.t. g =GCD(a, b).
1 int st[2] = {deg(a),deg(b)};
2 Let ` = 3s be the number of threads in a thread
block;
3 for (i = 0; i < n+m− 2; i = i+ s) do
4 OptGcdKer≪ dm/se, `≫ (a, b, s, st);
5 if a is a zero or constant polynomial then
6 Compute kb the maximum i s.t. b[i] 6= 0 holds;
7 Let g be array of size kb + 1 with coefficients in
K s.t. g[i] = b[i] for 0 ≤ i ≤ kb;
8 else
9 Compute ka the maximum i s.t. a[i] 6= 0 holds;
10 Let g be array of size ka + 1 with coefficients in
K s.t. g[i] = a[i] for 0 ≤ i ≤ ka;
11 return g;
6.3 Comparison of running time esti-
mates
Since we have 2` ≤ Z and 6s ≤ Z, we replace ` and
s by Z/2 and Z/6, respectively, and assume that m
equals to n. We first observe that the ratio W1/Ws is
asymptotically constant, essentially 8 (Z+1)3 (9Z+52) , and the
span ratio S1/Ss =
n+m−2
n+m is asymptotically 1. Next,
we compute the overhead ratio O1/Os, that is,
O1
Os
=
5
48
Z (2n+ 2 + Z)
6n+ Z
. (13)
We see that the parallelism overhead improvement is
done at a fairly low expense in terms of work overhead.
Applying Corollary 1, we denote the running time esti-
mate ratio R of the naive algorithm over the optimized
one, that is,
R =
(6n− 2 + Z) (3 + 5U)Z
(18n+ Z) (Z + 16U)
. (14)
When n escapes to infinity, the ratio R is equivalent to
(3 + 5U)Z
3 (Z + 16U)
. (15)
We observe that this ratio is larger than 1 if and only
if Z > 9.6 holds. This condition clearly holds and the
optimized algorithm is overall better. This is verified
in practice [9], where s is set to 256, also shown in
Table 2.
7 Conclusion
We have presented a model of multithreaded compu-
tation combining the fork-join and SIMD parallelisms,
with an emphasis on estimating parallelism overheads,
so as to reduce communication and synchronization
costs in GPU programs.
Four applications illustrated the effectiveness of our
model. In each case, we determined a range of values
for a program parameter in order to optimize the cor-
responding algorithm in terms of parallelism overheads.
Experimentation validated the model prediction.
Our order of magnitude estimates for the program
parameter of radix sort [19] agrees with the empirical
results of that paper.
For the Euclidean algorithm, our running time es-
timates match those obtained with the Systolic VLSI
Array Model [3]. Moreover, our CUDA code [9] im-
plementing this optimized Euclidean algorithm runs in
linear time w.r.t to the input polynomials degree, up to
degree 10,000.
For polynomial multiplication, our theoretical anal-
ysis implies that the program parameter s must be as
small as possible. In practice, we could vary this pa-
rameter between 2 and 32 and we found that the op-
timal value was 4. Since certain hardware features are
12
not integrated into the model, we found that the model
prediction was also useful in that case.
References
[1] R. D. Blumofe and C. E. Leiserson. Space-efficient
scheduling of multithreaded computations. SIAM
J. Comput., 27(1):202–229, 1998.
[2] R. D. Blumofe and C. E. Leiserson. Scheduling
multithreaded computations by work stealing. J.
ACM, 46(5):720–748, 1999.
[3] R. P. Brent and H. T. Kung. Systolic VLSI arrays
for polynomial GCD computation. IEEE Trans.
Computers, 33(8):731–736, 1984.
[4] M. Frigo, C. E. Leiserson, H. Prokop, and S. Ra-
machandran. Cache-oblivious algorithms. In
FOCS, pages 285–298, 1999.
[5] P. B. Gibbons. A more practical PRAM model. In
Proc. of SPAA, pages 158–168, 1989.
[6] P. B. Gibbons, Y. Matias, and V. Ramachandran.
The Queue-Read Queue-Write PRAM model: Ac-
counting for contention in parallel algorithms.
SIAM J. on Comput., 28(2):733–769, 1998.
[7] R. L. Graham. Bounds on multiprocessing tim-
ing anomalies. SIAM J. on Applied Mathematics,
17(2):416–429, 1969.
[8] T. D. Han and T. S. Abdelrahman. Reducing
branch divergence in GPU programs. In Proc. of
GPGPU-4, pages 3:1–3:8, 2011.
[9] S. A. Haque and M. Moreno Maza. Plain polyno-
mial arithmetic on GPU. In J. of Physics: Conf.
Series, volume 385, page 12014. IOP Publishing,
2012.
[10] Y. He, C. E. Leiserson, and W. M. Leiserson. The
Cilkview scalability analyzer. In Proc. of SPAA,
pages 145–156, 2010.
[11] C. E. Leiserson. The cilk++ concurrency plat-
form. The Journal of Supercomputing, 51(3):244–
257, 2010.
[12] W. Liu, W. Muller-Wittig, and B. Schmidt. Per-
formance predictions for general-purpose computa-
tion on GPUs. In Proc. of ICPP, page 50, 2007.
[13] L. Ma, K. Agrawal, and R. D. Chamberlain. A
memory access model for highly-threaded many-
core architectures. In Proc. of ICPADS, pages 339–
347, 2012.
[14] L. Ma and R. D. Chamberlain. A performance
model for memory bandwidth constrained applica-
tions on graphics engines. In Proc. of ASAP, pages
24–31. IEEE Computer Society, 2012.
[15] L. Mirsky. A dual of Dilworth’s decomposition the-
orem. The American Math. Monthly, 78(8):876–
877, 1971.
[16] J. Nickolls, I. Buck, M. Garland, and K. Skadron.
Scalable parallel programming with CUDA.
Queue, 6(2):40–53, 2008.
[17] C. NVIDIA. NVIDIA next generation CUDA com-
pute architecture: Kepler GK110, 2012.
[18] A. D. Robison. Composable parallel patterns with
Intel Cilk Plus. Computing in Science & Engineer-
ing, 15(2):0066–71, 2013.
[19] N. Satish, M. Harris, and M. Garland. Designing
efficient sorting algorithms for manycore GPUs. In
Proc. of IPDPS 2009, pages 1–10. IEEE, 2009.
[20] J. Shin. Introducing control flow into vectorized
code. In Proceedings of the 16th International Con-
ference on Parallel Architecture and Compilation
Techniques, PACT ’07, pages 280–291, Washing-
ton, DC, USA, 2007. IEEE Computer Society.
[21] L. J. Stockmeyer and U. Vishkin. Simulation of
parallel random access machines by circuits. SIAM
J. Comput., 13(2):409–422, 1984.
[22] J. E. Stone, D. Gohara, and G. Shi. OpenCL: A
parallel programming standard for heterogeneous
computing systems. Computing in science & engi-
neering, 12(3):66, 2010.
13
Algorithm 11: OptGcdKer(a, b, s st)
Input: a, b ∈ K[X], an integer s > 1 and st[] stores
the current degree of a and b.
1 Let sAc, sBc, sA, sB be local arrays of size
s, s, 2s, 2s respectively with coefficients in K;
2 local integers u = v = w = e = 0;
3 j =blockID·blockDim + threadID; t = threadID;
/* copying from global memory */
4 if t < s then
5 sAc[t] = a[st[0]− t];
6 sBc[t] = b[st[1]− t];
7 if t ≥ s then
8 sA[t− s] = a[st[0]− s blockID−t]
sB[t− s] = b[st[1]− s blockID−t];
9 for (k = 0; k < s; k = k + 1) do
10 if (st[0] ≥st[1]∧st[1] ≥ 0) then
11 if (u+ t < s) ∧ (v + t < s) then
12 sAc[u+ t] -= sBc[v + t]·sAc[u]·sBc[v]−1;
13 if (u+ t ≥ s) ∧ (v + t ≥ s) then
14 sA[w + t− s] -=
sB[e+ t− s]·sAc[u]·sBc[v]−1;
15 if t == 0 then
16 while sAc[u] = 0 do
17 u = u+ 1; w = w + 1;
st[0] =st[0]− 1;
18 if (st[1] ≥st[0]) ∧ (st[0] ≥ 0) then
19 if (u+ t < s) ∧ (v + t < s) then
20 sBc[v + t] -= sAc[u+ t]·sBc[v]·sAc[u]−1;
21 if (u+ t ≥ s) ∧ (v + t ≥ s) then
22 sB[e+ t− s] -=
sA[w + t− s]·sBc[v]·sAc[u]−1;
23 if t == 0 then
24 while sBc[v] = 0 do
25 v = v + 1; e = e+ 1;
st[1] =st[1]− 1;
26 if t ≥ s then
/* writing to global memory */
27 a[st[0]− s blockID−t] = sA[t− s];
28 b[st[1]− s blockID−t] = sB[t− s] ;
29 if j == min(st[0], st[1]) then
30 Update st array with the new degree of a and b;
n m s = 2 s = 4 s = 8 s = 16
4000 4000 0.004107 0.002775 0.003727 0.004811
5000 1000 0.001684 0.001204 0.001432 0.003645
5000 5000 0.008007 0.005846 0.007808 0.011715
6000 1000 0.001830 0.001298 0.001500 0.004129
6000 6000 0.010359 0.007614 0.010166 0.014352
7000 1000 0.001972 0.001381 0.001614 0.004551
7000 7000 0.013068 0.008821 0.012166 0.016938
8000 1000 0.002111 0.001456 0.001739 0.005015
8000 8000 0.016029 0.010853 0.014877 0.019740
Table 1: Running time (secs) of the polynomial multi-
plication algorithm with polynomials a (deg(a) = n−1)
and b (deg(b) = m− 1) and the parameter s
n m s = 1 s = 512
2000 1500 0.058 0.024
3000 2500 0.108 0.039
4000 3500 0.158 0.053
5000 4500 0.203 0.069
6000 5000 0.235 0.056
7000 6000 0.282 0.066
8000 7000 0.324 0.076
9000 8000 0.367 0.087
10000 9000 0.411 0.097
Table 2: Running time (secs) of the Euclidean algorithm
for polynomials a (deg(a) = n − 1) and b (deg(b) =
m− 1) with the parameter s
14
