Simultaneous Solving of Batched Linear Programs on a GPU by Gurung, Amit & Ray, Rajarshi
Received <day> <Month>, <year>; Revised <day> <Month>, <year>; Accepted <day> <Month>, <year>
DOI: xxx/xxxx
ARTICLE TYPE
Simultaneous Solving of Batched Linear Programs on a GPU
Amit Gurung*1 | Rajarshi Ray2
1Department of Computer Science &
Engineering, National Institute of
Technology Meghalaya, Shillong, India
2Department of Computer Science &
Engineering, National Institute of
Technology Meghalaya, Shillong, India
Correspondence
*Amit Gurung, Department of Computer
Science & Engineering, National Institute of
Technology Meghalaya. Email:
amitgurung@nitm.ac.in
Present Address
Department of Computer Science &
Engineering, National Institute of
Technology Meghalaya, Shillong - 793003,
India
Abstract
Linear Programs (LPs) appear in a large number of applications and offloading them
to a GPU is viable to gain performance. Existing work on offloading and solving an
LP on a GPU suggests that there is performance gain generally on large sized LPs
(typically 500 constraints, 500 variables and above). In order to gain performance
from a GPU, for applications involving small to medium sized LPs, we propose
batched solving of a large number of LPs in parallel. In this paper, we present the
design and implementation of a batched LP solver in CUDA, keeping memory coa-
lescent access, low CPU-GPU memory transfer latency and load balancing as the
goals. The performance of the batched LP solver is compared against sequential
solving in the CPU using the open source solver GLPK (GNU Linear Programming
Kit) and the CPLEX solver from IBM. The evaluation on selected LP benchmarks
from the Netlib repository displays a maximum speed-up of 95× and 5×with respect
to CPLEX and GLPK solver respectively, for a batch of 1푒5 LPs. We demonstrate
the application of our batched LP solver to enhance performance in the domain of
state-space exploration of mathematical models of control systems design.
KEYWORDS:
Linear programming; Batched linear programs; GPU; CUDA; Simplex method
1 INTRODUCTION
Traditional CPU computations are now increasingly being carried out on CPU-GPU heterogeneous systems, by offloading
data parallel tasks to a GPU for accelerating performance. Some of the application domains where GPUs have been used for
performance acceleration include medical image processing (1, 2), weather research and forecasting (WRF) (3), Proteomics
(to speed-up peptide spectrum matching (4)), signal processing for radio astronomy (5), simulation of various physical and
mechanical systems (using variants of Monte Carlo algorithm) (6, 7) and large scale graph processing (8). However, drawing
performance from a GPU requires insights on its architecture. Some of the common challenges of computing in a heterogeneous
ar
X
iv
:1
80
2.
08
55
7v
1 
 [c
s.D
C]
  2
1 F
eb
 20
18
2 Amit Gurung ET AL
architecture are load balancing, efficient memory access and an effective mapping of computations in the SIMD paradigm of
computing.
Linear Programming is a method of maximizing or minimizing a linear objective function subject to a set of linear constraints.
Linear programs (LPs) appear extensively in a large number of application domains such as business process modeling to
maximize profit, economics to design optimized demand-supply model (9), transport assignment (10), job scheduling (11) and
packets routing in computer networks, to name just some. Our work is motivated from applications that require solving a large
number of small-size independent LPs. We propose a hybrid CPU-GPU LP solver which can solve a batch of many independent
problems simultaneously. Our work assumes the setting that computations begin in a CPU where LPs are created, batched and
then offloaded to a GPU for an accelerated solution. The solutions are transferred back to the CPU from the GPU for further
processing. As an example, we show an application of our batched LP solver in model-based design of control systems. In model-
based design, state-space exploration is a standard technique to analyze models of control systems. State of the art methods and
tools for state-space exploration, heavily rely on solving many independent LPs (12, 13). Moreover, the LPs are generally of
small size (the size of an LP is the number of constraints × number of variables). We show that using our batched LP solver, the
state-space exploration tools can improve the performance significantly.
Prior work on solving an LP on a GPU and on multi-GPU architectures are many. The focus of almost all such works has been
on methods to improve the performance of algorithms to solve one LP at a time. Performance gain is reported generally when
offloading large LPs of size 500 (500 constraints, 500 variables) and above (14, 15, 16, 17, 18, 19). It is seen that for small size
LPs, the time spent in offloading the problems from CPU to GPUmemory is more than the time gained with parallel execution in
the GPU. Our work in this paper target applications that involves solving small to medium size LPs, but many of them. Although
modern LP solvers like CPLEX (20) andGLPK (21) are very efficient in solving small LPs, solvingmany of them one by onemay
consume considerable time. Note that using any of the prior work to solve an LP at a time in GPUwill not provide acceleration in
such applications since they perform well only on large LPs. We show that with batched computation, performance acceleration
can be achieved even for small LPs (e.g. LPs of size 5) for a considerably large batch size, where batch size refers to the number
of LPs in a batch. We present a CUDA C/C++ implementation of a solver which implements the simplex method (22), with
an effort to keep coalescent memory accesses, efficient CPU-GPU memory transfer and an effective load balancing. To the best
of our knowledge, this is the first work in the direction of batched LP solving on a GPU. Batched computations on a GPU to
draw performance on small problem instances is recently adopted in (23, 24, 25, 26, 27, 28). The solver source and necessary
instructions for repeatability evaluation can be found at https://bitbucket.org/rajgurung777/simplexprojects. We report solutions
of LPs of dimension up to 511 × 511 (511 variables, 511 constraints) with our solver. Beyond a sufficiently large batch size,
our implementation shows significant gain in performance compared to solving them sequentially in the CPU using the GLPK
library (21), an open source LP solver and the CPLEX solver from IBM. The evaluation on selected LP benchmarks from the
Amit Gurung ET AL 3
Netlib repository displays a maximum speed-up of 95× and 5× with respect to CPLEX and GLPK solver respectively, for a
batch of 1푒5 LPs. In addition, we consider a special class of LPs having the feasible region as an hyper-rectangle and exploit
the fact that these can be solved cheaply without using the simplex algorithm. We implement this special case LP solver as part
of the solver.
The rest of the paper is organized as follows. Related works are discussed in Section 2. A motivating application of batched
solving of small LPs is discussed in Section 3. In Section 4, we discuss the simplex method that is needed to appreciate the
rest of the paper. Section 5 illustrates our CUDA implementation for solving batched LPs on a GPU, with memory coalescence,
effective load balancing and efficient CPU-GPU memory transfer using CUDA streams. Section 6 presents an evaluation of our
solver on batched solving of LP benchmarks in comparison to solving the same sequentially with GLPK and CPLEX. Section
7 demonstrates the performance enhancement in state-space exploration of models of control systems, on using our solver. We
conclude in Section 8.
2 RELATEDWORK
A multi-GPU implementation of the simplex algorithm in (14) reports a speed-up of 2.93× on dense randomly generated LPs
of dimension 1000 × 1000. An average speed-up of 12.7× has been reported for problems of dimension 8000 × 8000 or higher
on a single GPU as compared to a sequential CPU implementation. An implementation of the revised simplex method using
the OpenGL graphics library is reported in (15). An average speed-up of 18× is reported, compared to the GLPK library, for
randomly generated dense problems of size 2700 × 2700 or higher. The paper also reports that the number of iterations in the
simplex procedure is considerably less when the “projected steepest-edge” pivoting rule is used as against Dantzig’s “maximum
reduced cost” rule. A GPU implementation of the revised simplex algorithm is also reported in (16) with a speed-up of 2× to
2.5× in comparison to a serial ATLAS-based CPU implementation, for LPs of dimension 1400 to 2000. Automatically Tuned
Linear Algebra Software (ATLAS(29)) is a library providing BLAS (Basic Linear Algebra Subprograms) APIs for C and Fortran.
BLAS (30) is a specification that prescribes routines for basic vector and matrix operations. A GPU-based implementation of the
Revised Simplex algorithm and a Primal-Dual Exterior Point Simplex Algorithm (PDEPSA) is presented in (17). Pre-processing
techniques on LPs have been proposed in order to gain performance. The LP solving algorithms use the steepest-edge pivoting
rule which the authors claim to show the best performance in comparison to the other pivoting rules (31). Experimental results
on Netlib (32), Kennington and Mèszàros benchmarks shows that, on average the PDEPSA is 2.30× and 32.25× faster than the
MATLAB’s multi-threaded built-in function linprog, which implements the interior point method, and a sequential CPU-based
PDEPSA respectively. ACPU-GPU hybrid implementation of the Simplex algorithm is proposed in (33). The conditional control
instructions are executed in the CPU and the GPU execute the compute intensive tasks. Experimental result display a maximum
4 Amit Gurung ET AL
speed-up of 120× for LPs of size 7000. A single and a multi-GPU CUDA implementation of the standard simplex method with
the steepest edge pivoting rule is proposed in (18). A comparison of their implementation with an open source solver, CLP is
provided. It is shown empirically, that the performance of the revised simplex implementation (CLP) degrades with an increase
in the density of the LP. However, the performance of their simplex implementation remains stable even when the LP density
increases. Another implementation of the standard simplex method is presented in (19) for solving dense LP problems with
double precision arithmetic. An average speed-up of 12.61× is shown for LPs of size greater than 3000 as compared to the
sequential CPU counterpart. We observe that almost all prior works report speed-up when the LP size is generally 500 or above.
(33) is an exception which shows a marginal speed-up on LPs of size 100.
3 MOTIVATING APPLICATION
In model-based design of control systems, a standard technique of analysis is to compute the state-space of the model using
exploration algorithms. Properties of the control system such as safety and stability are analyzed by observing the computed
state-space. In this section, we consider two open-source tools that perform state-space exploration of control systems with linear
dynamics, namely SpaceEx (12) and XSpeed (13). These tools can analyze systems modeled using a mathematical formalism
known as hybrid automaton (34). A conservative over-approximation of the exact state space is computed by both the tools.
The state of the art state-space exploration algorithm in these tools compute the state-space as a union of convex sets, each
having a symbolic representation, known as the support function representation (35). The algorithm requires a conversion of
these convex sets from its symbolic support function representation to concrete convex polytope representation, in order to
have certain operations efficient. This conversion involves solving a number of linear programs. Moreover, the precision of the
conversion and consequently, the precision of the computed state-space depends on the number of LPs solved. Table 1 shows
the number of LPs and its dimension that these tools solve for a fairly accurate state-space computation over a time horizon of
just 100 seconds, on some standard control systems benchmarks.
Benchmark LP Dimension Number of LPs
Fourth Order Filtered Oscillator 6 7.2푒7
Eight Order Filtered Oscillator 10 2.0푒8
Helicopter Controller 28 1.568푒9
TABLE 1 A large number of LP solving is required for a fairly accurate state-space computation.
We see that the number of LPs to be solved in the above examples is in the order of 1푒9which cannot be solved in practical time
limits even by the fast modern LP solvers like GLPK or CPLEX, when solved sequentially. Figure 1 shows the computed state-
space of a Filtered Oscillator model using the tool XSpeed. This computation required solving 7.2푒7 LPs. Note that although a
Amit Gurung ET AL 5
solver like CPLEX take approximately 0.001 seconds to solve an LP of dimension 6 in a modern CPU, it will take nearly 7 hours
to solve 7.2푒7 LPs sequentially. Therefore, we believe that there is a need to accelerate applications where such bulk LP solving
is necessary. A brief description of the control systems benchmarks of Table 1 and the performance improvement when the
tool XSpeed use our batched LP solver is reported in Section 7.
−0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8
−0.6
−0.4
−0.2
0.0
0.2
0.4
0.6
x
y
FIGURE 1 State-space of a fourth order filtered oscillator model
4 LINEAR PROGRAMMING
A linear program in standard form is maximizing an objective function under the given set of linear constraints, represented as
follows:
푚푎푥푖푚푖푧푒
푛∑
푗=1
푐푗푥푗 (1)
subject to the constraints
푛∑
푗=1
푎푖푗푥푗 ≤ 푏푖 푓표푟 푖 = 1, 2, ..., 푚 (2)
and
푥푗 ≥ 0 푓표푟 푗 = 1, 2, ..., 푛 (3)
In Expression (1),∑푛푗=1 푐푗푥푗 is the objective function to bemaximized and Inequality (2) shows the푚 constraints over 푛 variables.
Inequality (3) shows the non-negativity constraints over 푛 variables. An LP in standard form can be converted into slack form
6 Amit Gurung ET AL
by introducing 푚 additional slack variables (푥푛+푖), one for each inequality constraint, to convert it into an equality constraint,
as shown below:
푥푛+푖 = 푏푖 −
푛∑
푗=1
푎푖푗푥푗 , 푓표푟 푖 = 1, ..., 푚 (4)
An algorithm that solves LP problems efficiently in practice is the simplex method described in (22). The variables on the left-
hand side of the Equation (4) are referred as basic variables and those on the right-hand side are non-basic variables. The initial
basic solution of an LP is obtained by assigning its non-basic variables to zero. The initial basic solution may not be always
feasible (when one or more of the 푏푖’s are negative, resulting in the violation of the non-negativity constraint). For such LPs,
the simplex method employs a two-phase algorithm. In the first phase, a new auxiliary LP is formed by having a new objective
function 푧, which is the sum of the newly introduced artificial variables. The simplex algorithm is employed on this auxiliary
LP and it is checked if the optimal solution to the objective function is zero. If a zero optimal is found then it implies that the
original LP has a feasible solution and the simplex method initiates the second phase. In the second phase, the feasible slack
form obtained from the first phase is considered and the original objective function is restored with appropriate substitutions
and elimination of the artificial variables. The simplex algorithm is then employed to solve the LP.
Prior to the simplex method, many LP solvers apply preconditioning techniques such as a simple geometric mean scaling in
combination with equilibration to reduce the condition number of the constraint matrix in order to decrease the computational
effort for solving an LP (36, 37, 38, 39). In this work, we do not apply any pre-conditioning on the LP for simplicity and use the
simplex algorithm described in the following section.
4.1 The Simplex Algorithm
The simplex algorithm is an iterative process of solving a LP. Each iteration of the simplex algorithm attempts to increase the
value of the objective function by replacing one of the basic variables (also known as the leaving variable), by a non-basic
variable (called the entering variable). The exchange of these two variables is obtained by a pivot operation. The index of the
leaving and the entering variables are called the pivot row and pivot column respectively. The simplex algorithm iterates on a
tabular representation of the LP, called the simplex tableau. The simplex tableau stores the coefficients of the non-basic, slack
and artificial variables in its rows. It contains auxiliary columns for storing intermediate computations. In our implementation,
we consider a tableau of size 푝 × 푞, where 푝 = 푚 + 1 and 푞 = 푛 + sum of slack and artificial variables + 2. The (푚 + 1)th
row stores, the best solution to the objective function found until the last iteration, along with the coefficients of the non-basic
variables in the objective function.
There are two auxiliary columns, the first column stores the index of the basic variables and the second stores 푏푖’s of inequality
(2). Figure 2 shows a schematic of the simplex tableau.
Amit Gurung ET AL 7
Index b x1      x2   . . .    xn xn + 1  xn + 2  . . .  xn + m a1 a2  . . .  as
Index of
basic
variables
Bounds of
the
constraints
Coefficients of
non-basic
variables
Coefficients of
slack variables
Coefficients of
artificial variables
unused OptimalSolution 
Coefficients of non-basic variable in objective function
(used to determine entering variable)
FIGURE 2 Formation of the Simplex Tableau.
Step 1: Determine the entering variable.
At each iteration, the algorithm identifies a new entering variable from the non-basic variables. It is called an entering variable
since it enters the set of basic variables. The choice of the entering variable is with the goal that increasing its value from 0
increases the objective function value. The index of the entering variable is referred to as the pivot column. The most common
rule for selecting an entering variable is by choosing the index 푒 of the maximum in the last row of the simplex tableau (excluding
the current optimal solution).
Step 2: Determine the leaving variable.
Once the pivot column is determined (say 푒), the algorithm identifies the row index with the minimum positive ratio (푏푖∕−푎푖,푒),
say 퓁, called the pivot row. The variable 푥퓁 is called the leaving variable because it leaves the set of basic variables. This ratio
represents the extend to which the entering variable 푥푒 (in Step 1) can be increased without violating the constraints.
Step 3: Obtain the new improved value of the objective function.
The algorithm then performs the pivot operation which updates the simplex tableau such that the new set of basic variables are
expressed as a linear combination of the non-basic ones, using substitution and rewriting. An improved value for the objective
function is found after the pivot operation.
The above steps are iterated until the halt condition is reached. The halt condition is met when either the LP is found to be
unbounded or the optimal solution is found. An LP is unbounded when no new leaving variable can be computed, i.e., when
the ratio (푏푖∕ − 푎푖,푒) in Step 2 is either negative or undefined for all 푖. An optimal solution is obtained when no new entering
variable can be found, i.e., the coefficients of the non-basic variables in the last row of the tableau are all negative values
5 SIMULTANEOUS SOLVING OF BATCHED LPS ON A GPU
We present our CUDA implementation that solves batched LPs in parallel on a GPU. In this discussion, we shall refer a CPU by
host and a GPU by device. The LP batching is performed on the host and transferred to the device. Our solver implementation
8 Amit Gurung ET AL
assumes that all the LPs in a batch are of the same size. The batch size is adjustable, depending on the device memory size and
LP size. Our batching routine considers the maximum batch size that can be accommodated in the device memory.
5.1 CPU-GPU Memory Transfer and Load Balancing
First, we allocate device memory (global memory) from the host, that is required for creating a simplex tableau for the LPs in
the batch. The maximum number of LPs that can be batched depends on the size of the device global memory in the device.
The tableau for every LP in the batch is populated with all the coefficients and indices of the variables in the host side, before
transferring to the device. To speed-up populating the tableau in the host, we initialize the tableau in parallel using OpenMP
threads. Once initialized, the Simplex tableaux are copied from the host to the device memory (referred to as H2D-ST in Figure
6 ). The LP batching routine is shown in Algorithm 1. The GPU kernel modifies the tableau to obtain solutions using the simplex
method for every LP in the batch and copies back from the device to the host memory (referred to as D2H-res in Figure 6 ). We
discuss further on our CPU-GPU memory transfer using CUDA streams for efficiency in Section 5.4.
Algorithm 1 Batching Routine: 푁 – the number of LP problems present in the data structure 푙푖푠푡퐿푃 . Computed results are
returned in 푅
1: procedure BATCHING(푁, 푙푖푠푡퐿푃 ,푅)
2: 푔푝푢푀푒푚 ← 푔푒푡푀푒푚푆푖푧푒() ⊳ get GPU’s global memory size
3: 푙푝푆푖푧푒 ← 푔푒푡퐿푃푆푖푧푒() ⊳ get memory requirement per LP
4: 푡ℎ푟푒푎푑푆푖푧푒 ← 푔푒푡푇 ℎ푟푒푎푑푆푖푧푒() ⊳ computes the appropriate thread size based on LP dimension
5: 푏푎푡푐ℎ푆푖푧푒 = 푔푝푢푀푒푚 ÷ 푙푝푆푖푧푒
6: if 푁 > 푏푎푡푐ℎ푆푖푧푒 then
7: 푡표푡퐵푎푡푐ℎ = 푐푒푖푙(푁 ÷ 푏푎푡푐ℎ푆푖푧푒)
8: for 푖 = {0, ..., (푡표푡퐵푎푡푐ℎ − 1)} do
9: 푠푡푎푟푡 = 푖 ∗ 푏푎푡푐ℎ푆푖푧푒
10: if 푖 == (푡표푡퐵푎푡푐ℎ − 1) then
11: 푒푛푑 = 푁 − 1
12: else
13: 푒푛푑 = 푠푡푎푟푡 + 푏푎푡푐ℎ푆푖푧푒 − 1
14: end if
15: 푑푒푣퐿푃 ← copy 푙푖푠푡퐿푃 from index (푠푡푎푟푡 to 푒푛푑)
16: 푏푎푡푐ℎ퐾푒푟푛푒푙 <<< 푏푎푡푐ℎ푆푖푧푒, 푡ℎ푟푒푎푑푆푖푧푒 >>> (푑푒푣퐿푃 ,푅)
17: end for
18: else
19: 푑푒푣퐿푃 ← copy 푙푖푠푡퐿푃 from index (1 to푁)
20: 푏푎푡푐ℎ퐾푒푟푛푒푙 <<< 푏푎푡푐ℎ푆푖푧푒, 푡ℎ푟푒푎푑푆푖푧푒 >>> (푑푒푣퐿푃 ,푅)
21: end if
22: end procedure
Amit Gurung ET AL 9
Load Balancing
We assign a CUDA block of threads to solve an LP in the batch. Since blocks are scheduled to Streaming Multiprocessors
(SMs), this ensures that all SMs are busy when there are sufficiently large number of LPs to be solved in the batch. As CUDA
blocks execute asynchronously, such a task division emulates solving many LPs independently in parallel. Moreover, each block
is made to consist of 푗 (≥ 푞) threads, which is a multiple of 32, as threads in GPU are scheduled and executed as warps. The
block of threads is utilized in manipulating the simplex tableaux in parallel, introducing another level of parallelism. In Figure
3 , we show a block diagram of the logical threads in CUDA, of our GPU kernel.
FIGURE 3 Visualization of how threads are mapped to solve푁 LPs in GPU. Each block is mapped to an LP and 푗 threads are
assigned to parallelize a single LP.
.
5.2 Implementation of the Simplex Algorithm
Finding the pivot column in Step 1 of the simplex algorithm above requires to determine the index of the maximum value from
the last row of the tableau. We parallelize Step 1 by utilizing 푛 (out of 푗) threads in parallel to determine the pivot column using
parallel reduction described in (40). A parallel reduction is a technique applied to achieve data parallelism in GPU when a
single result (e.g. min, max) is to be computed from an array of data. We have implemented a parallel reduction by using two
auxiliary arrays, one for storing the data and the other for storing the array indices of the corresponding data. The result of a
parallel reduction provides us the maximum value in the data array and its corresponding index in the indices array.
We also apply parallel reduction in Step 2 by utilizing 푚 (out of 푗) threads in parallel to determine the pivot row (푚 being
the row-size of the simplex tableau). It involves finding a minimum positive value from an array of ratios (as described in
Step 2 above) and therefore ratios which are not positive needs to be excluded from the minimum computation. This leads to
a conditional statement in the parallel reduction algorithm and degrades the performance due to warp divergence. Even if we
10 Amit Gurung ET AL
re-size the array to store only the positive values, the kernel still contains conditional statements to check the threads that need
to process this smaller size array. To overcome performance degradation with conditional statements, we substituted a large
positive number in place of ratios that are negative or undefined. This creates an array that is suitable for parallel reduction in
our kernel implementation.
Data parallelism is also employed in the pivot operation in Step 3, involving substitution and re-writing, using the (푚 − 1)
threads (out of 푗 threads in the block).
5.3 Coalescent Memory Access
In this section, we discuss our efforts on keeping a coalescent access to global memory to reduce performance loss due to cache
misses. When threads in a warp access contiguous locations in the memory, the access is said to be coalescent. A coalescent
memory access results in performance benefits due to an increased cache hit rate.
As discussed earlier, we use global memory to store the simplex tableaux of the LPs in a batch as described in Section 5.1
(Since the global memory in a GPU is of the maximum size in the memory hierarchy, it can accommodate many tableaux). We
store the simplex tableau in memory as a two-dimensional array. High level languages like C and C++ use the row-major order
by default for representing a two-dimensional array in the memory. CUDA is an extension to C/C++ and also use the row-major
order. The choice of row or column major order representation of two-dimensional arrays plays an important role in deciding
the efficiency of the implementation, depending on whether the threads in a warp access the adjacent rows or adjacent columns
of the array and what is the offset between the consecutive rows and columns.
We use the term column-operation, when elements of all rows from a specific column are accesses simultaneously by each
thread in a warp. If the array is in a row-major order, then this operation is not a coalesced memory access, as each thread access
elements from the memory separated by the size equal to the column-width of the array. When elements of a specific row are
accessed simultaneously by threads of a warp, we called this a row-operation. Note that for a two dimensional array stored in
row-major order, a row-operation is coalesced since each thread access data from contiguous locations in the memory.
We now show that in the simplex algorithm described above, there are more column-operations than row operations and thus,
storing our data (i.e. simplex tableau) in a column-major order ensure more coalesced memory access in comparison to having
a row-major storage.
Step 1 of the simplex algorithm determines the entering variable (also known as the pivot column), which requires finding
the index of the maximum positive coefficient from the last row. This requires a row-operation and as mentioned in Section 5.2,
we use parallel reduction using two auxiliary arrays, Data and Indices as shown in Figure 4 . Although accessing from the last
row of the simplex tableau is not coalesced (due to our column-major ordering) but copying into the Data (and Indices) array is
Amit Gurung ET AL 11
coalesced and so is the parallel reduction algorithm on the Data (and Indices) array. We use the technique of Parallel Reduction:
Sequential Addressing in (40), a technique that ensures coalesced memory access.
Index b x1 x2 . . . xn xn + 1 xn + 2 . . . xn + m a1 a2 . . . as
Index of
basic
variables
Bounds of
the
constraints
unused Optimal
Solution 
Coefficients of non-basic variable in objective function
(used to determine entering variable)
 
MAX MAX
  
0 1 2 3 4 5 6 . . . . . . n-1
    threadIdx.x
Data (parallel copy)
Indices
FIGURE 4 Simplex Tableau along with two separate arrays,Data to store the coefficients of the objective function and Indices
to keep track of the indices of the corresponding values in the Data array.
Step 2 of the simplex algorithm determines the leaving variable (also called the pivot row) by computing the row index with
the minimum positive ratio (푏푖∕ − 푎푖,푒), as described in Section 4.1. This requires two column-operations involving the access
to all elements from columns 푏 and 푥푒 as shown in Figure 5 . To compute the row index with the minimum positive ratio, we
use parallel reduction as described above in Section 5.2. Our tableau being stored in a column-major order, access to columns
푏 and 푥푒 are both coalesced. The ratio and its corresponding indices (represented by the thread ID) are stored in the auxiliary
arrays, Data and Indices which is also coalesced. Like in Step 1, we use the same technique of Parallel Reduction: Sequential
Addressing in (40) for coalesced memory access.
Step 3 performs the pivot operation that updates the elements of the simplex tableau and is the most expensive of the three
steps. It first involves a non-coalescent row-operation which computes the new modified pivot row (denoted by the index 퓁) as
{푁푒푤푃 푖푣표푡푅표푤퓁 = 푂푙푑푃 푖푣표푡푅표푤퓁 ÷ 푃퐸}, where PE is the element in cell at the intersection of the pivot row and the pivot
column for that iteration, known as the pivot element. The modified row (푁푒푤푃 푖푣표푡푅표푤퓁) is then substituted to update each
element of all the rows of the simplex tableau, using the formula 푁푒푤푅표푤푖푗 = 푂푙푑푅표푤푖푗 − 푃 푖푣표푡퐶표푙푖푒 ∗ 푁푒푤푃 푖푣표푡푅표푤퓁푗
(see code Listing 1). The elements of the pivot column are first stored in an array named 푃 푖푣표푡퐶표푙 which is a column-operation,
and so is coalesced, due to the column-major representation of the tableau. The crucial operation is updating each 푗푡ℎ element
for every 푖푡ℎ row (except the pivot row 퓁) of the simplex tableau, which requires a nested for-loop operation. We parallelize the
12 Amit Gurung ET AL
Index b x1 x2 . . . . . . xe . . . . . . . . . xn+m+s
1 0
i bi aie
7 7
unused Optimal
Solution 
Coefficients of non-basic variable in objective function
(used to determine entering variable)
 
.
.
.
th
re
ad
Id
x.
x
bi/-aie i
0
1
.
.
.
MAX
MAX
MAX n-1
Data Indices
FIGURE 5 Simplex Tableau along with two separate arrays, Data to store the positive ratio and Indices to keep track of the
indices of the corresponding values in the Data array. Ratios that reduces to negative or undefined are replaced by a large value
denoted by MAX.
outer for-loop that maps the rows of the simplex tableau. Our data being represented in a column-major order, parallel access to
all rows for each element in the 푗푡ℎ column of the inner for-loop is coalesced.
Listing 1: Code fragment for step 3 that updates the simplex tableau.
f o r ( i n t i =0; i <rows ; i ++) { / / P a r a l l e l i z e d o u t e r loop t o map each i w i t h t h e t h r e ad
→ ID
f o r ( i n t j =0; j <c o l s ; j ++) {
NewRow[ i ] [ j ] = OldRow [ i ] [ j ] − P i vo tCo l [ i ] ∗ NewPivotRow [ l ] [ j ] ; / / l : p i v o t row
→ i n d e x
}
}
To verify the performance gain due to coalesced memory access, we experiment with Step 3 which is the most expensive
of the three steps in the simplex algorithm, by modifying it to have non-coalesced memory accesses. In the code Listing 1, we
interchange the inner for-loop with the outer loop (loop interchange, a common technique to improve cache performance (41)).
This loop interchanging converts the Step 3 to have non-coalesced memory access since our simplex tableau is represented in a
column-major order. Table 2 presents the experimental results to show the gain in performance when the accesses to memory
Amit Gurung ET AL 13
are coalesced, as compared to having non-coalesced memory accesses. The results show a significant gain in performance on a
Tesla K40c GPU, for LPs with the initial basic solution as feasible.
LP Dim Batch-size Speed-up
10 1000 0.193 0.016 12.06
50 1000 0.286 0.033 8.67
100 1000 0.947 0.105 9.02
200 1000 4.739 0.397 11.94
300 1000 14.482 0.921 15.72
400 1000 30.320 2.109 14.38
500 1000 43.416 2.844 15.27
Non-coalesced 
Access Time 
(seconds)
Coalesced 
Access Time 
(seconds)
TABLE 2 Time taken to solve batched LP due to coalesced and non-coalesced memory access on a GPU, for LPs with the
initial basic solution as feasible.
We observe that Step 1 has a row-operation, Step 2 has two column-operations and Step 3 has a row and a column operation
each with a nested for-loop which can be expressed both in row as well as column operations. We see that there are more column
operations than row operations.
5.4 Overlapping data transfer with kernel operations using CUDA Streams
The memory bandwidth of host-device data copy is a major bottleneck in CUDA applications. We use nvprof (42) to profile
time for memory transfer and kernel operation of our implementation discussed above in Section 5. The result of profiling in a
Tesla K40c GPU, for LPs with an initial basic solution as feasible, is reported in Table 3 . We observe that, for a small batch
size (e.g. 10 in the Table 3 ), the memory copy operation takes a maximum of 5.75% of the execution time, whereas for larger
batch size, the memory copy operation takes 10% to 15% of the execution time.
A standard technique to improve the performance in CPU-GPU heterogeneous applications is by using CUDA streams. CUDA
streams allow overlappingmemory copy operationwith kernel execution. A stream in CUDAconsists of a sequence of operations
executed on the device in the order in which they are issued by the host procedure. These operations can not only be executed
in an interleaved manner, but also be executed concurrently in order to gain performance (43).
A GPU in general, has a separate kernel and a copy engine. All kernel operations are executed using the kernel engine
and memory copy operations to and from the device are performed by the copy engine. However, some GPUs have two copy
engines, one each for copying data from host to device and from device to host, for better performance. Figure 6 illustrates the
overlapping of kernel executions with memory copy operation, when the GPU has one kernel and one copy engine. Streaming
14 Amit Gurung ET AL
Time %
Kernel
MemCpy
Total %
H2D D2H
10 10 98.79 0.63 0.59 100
10 90000 93.93 6.06 0.01 100
50 10 99.22 0.71 0.07 100
50 90000 84.74 15.26 0.00 100
200 10 98.40 1.59 0.01 100
200 9000 84.93 15.07 0.00 100
500 10 94.25 5.75 0.00 100
500 900 86.04 13.96 0.00 100
LP 
Dimension
Batch-
size
TABLE 3 Profiling report obtained using nvprof tool for LPs with an initial basic solution as feasible. H2D - stands for host
to device and D2H indicates device to host memory copies respectively.
FIGURE 6 Performance gain due to overlapping kernel execution with data transfer compared to sequential data transfer and
kernel execution. The time required for host-to-device (H2D), device-to-host (D2H) and kernel execution are assumed to be the
same.
by batching similar operations causes more overlap of copies with kernel executions, as depicted in the figure. Adding all host-
to-device copy to the CUDA streams followed by all kernel launches and device-to-host data copies, can result in a significant
overlap of memory copy operations with kernel executions, resulting in a performance gain. When there are two copy engines,
looping the operations in the order of a host-to-device copy followed by kernel launch and device-to-host copy, for all streams
may result in a better performance than the former method. For GPUs with compute capability 3.5 and above, both the methods
result in the same performance due to the Hyper-Q (44) feature.
Amit Gurung ET AL 15
Although a large number of CUDA streams achieves more concurrency and interleaving among operations, it incurs stream
creation overhead. The number of CUDA streams that gives optimal performance is found by experimentation. From our exper-
imental observations, we conclude that with varying the batch size and the LP dimension, the optimal number of streams also
varies. In this paper, we report results with 10 streams for batch size more than 100 LPs. We use a single stream when the batch
size is less than 100 (for LPs of any dimension).
5.5 Limitations of the Implementation
The memory required for an LP (i.e., a tableau) in our implementation can be approximately computed as:
푌 = {(푚 + 1) × 푐표푙푠 × 푑푎푡푎푆푖푧푒 + 푥} (5)
푐표푙푠 = (푣푎푟 + 푠푙푎푐푘 + 푎푟푡푖 + 2)
푑푎푡푎푆푖푧푒 = 푠푖푧푒표푓 (퐷푇푦푝푒)
푥 = 2 × (푐표푙푠 × 푑푎푡푎푆푖푧푒)
where (푚 + 1) and 푐표푙푠 are the sizes of rows and columns of the simplex tableau respectively, 푣푎푟 is the number of variables
(dimension of the LP problem), 푠푙푎푐푘 is the number of slack variables and 푎푟푡푖 is the number of artificial variables in the given
LP. The size of each LP is 푌 bytes, where 퐷푇푦푝푒 is the data type of the LP coefficients and 푥 is the size of array used for
performing parallel reduction operation, the number 2 in the equation 푥 = 2×(푐표푙푠×푑푎푡푎푆푖푧푒) signify the use of two auxiliary
arrays. The size of the 푐표푙푠 is described in Subsection 4.1. Thus, if 푆 is the size of total global memory (in bytes) available in
the GPU, then our threshold limit or the number of LPs that can be solved at a time is determined by the equation푁 = ⌊푆
푌
⌋. As
the current limit on threads per block is 1024 in CUDA, our implementation limits the size of LPs having initial basic solution
as feasible to 511× 511. The size limit for LPs having initial basic solution as infeasible is 340× 340. This limit is derived from
(6). We intend to address this limitation in future work.
(푣푎푟 + 푠푙푎푐푘 + 푎푟푡푖 + 2) ≤ 1024 (6)
16 Amit Gurung ET AL
5.6 Solving a Special Case of LP
The feasible region of an LP given by its constraints defines a convex polytope. We observe that when the feasible region is a
hyper-rectangle, which is a special case of a convex polytope, the LP can be solved cheaply. Equation (7) shows that maximizing
the objective function is the sum of the results on 푛 dot products.
maximize
푥∈ (퓁.푥) =
푛∑
푖=1
퓁푖.ℎ푖,where ℎ푖 =
⎧⎪⎪⎨⎪⎪⎩
푎푖 if 퓁푖 < 0
푏푖 otherwise
(7)
where 퓁 ∈ ℝ푛 is the sampling directions over the given hyper-rectangle  = {푥 ∈ ℝ푛|푥 ∈ [푎1, 푏1] × ... × [푎푛, 푏푛]}.
An implementation for solving this special case of LPs is incorporated in our solver. In order to solve many LPs in parallel,
we organize CUDA threads in a one-dimensional block of threads with each block used to solve an LP. Each block is made to
consist of only 32 threads, the warp size. Within each block, we used only a single thread to perform the operations of the kernel.
A preliminary introduction to this technique is introduced in the paper (13).
6 EVALUATION
We evaluate our solver on two set of LPs. The first is a set of randomly constructed LPs. The LPs in this set are constructed by
randomly selecting the coefficients of the constraint matrix 퐴 from a range of [1...1000], the bounds of the constraints 푏 from a
range of [1...1000] and the coefficients of the objective functions 푐 from the range of [1...500] respectively. The second set of LPs
are selected from the Netlib repository. On both the set of LPs, we evaluate the performance of our batched solver on varying
batch sizes. In the text that follows, we refer to our solver on a GPU as BLPG, abbreviating Batched LP solver on a GPU. The
solver using CUDA streams is referred as BLPG-SM. We perform our experiment in Intel Xeon E5-2670 v3 CPU, 2.30 GHz,
12 Core (without hyper-threading), 62 GB RAM with Nvidia’s Tesla K40c GPU. The reported performance is an average over
10 runs. A performance comparison of our solver to GLPK is shown in Figure 7 , for the first set of randomly generated LPs
of various sizes and various batch sizes. We observe a maximum speed-up of 16× for LPs having the initial basic solution as
feasible, for a batch of 20k LPs of dimension 100. For the same type of LPs, we see a maximum speed-up of 18×, for a batch
of 50푘 LPs of dimension 100, using BLPG-SM. We observe that for LPs of large size, BLPG performs better even with a few
LPs in parallel (e.g., batch size=50 for a 500 dimensional LP). However, for small size LPs, BLPG out-performs GLPK only
for larger batch sizes (e.g. a batch size of 100 for a 5 dimensional LP).
Table 4 shows a performance comparison of BLPG with GLPK on LPs having the initial basic solution as infeasible. In-
spite of the fact that BLPG executes the kernel twice due to the two-phase simplex algorithm as discussed in Section 4 (an extra
Amit Gurung ET AL 17
Batch-size
5 Dimension
Time (Sec) Speed-up w.r.t. GLPK
GLPK BLPG BLPG-SM Vs BLPG Vs BLPG-SM
1 0.0000 0.0000 0.0000 0.00 0.00
50 0.0010 0.0000 0.0002 0.00 5.00
100 0.0012 0.0001 0.0011 12.00 1.09
500 0.0094 0.0010 0.0021 9.40 4.48
1000 0.0163 0.0030 0.0033 5.43 4.94
1500 0.0310 0.0090 0.0040 3.44 7.75
2000 0.0410 0.0050 0.0060 8.20 6.83
5000 0.1040 0.0130 0.0140 8.00 7.43
10000 0.1660 0.0250 0.0250 6.64 6.64
20000 0.3170 0.0490 0.0480 6.47 6.60
50000 0.7750 0.1220 0.1200 6.35 6.46
100000 1.6630 0.2420 0.2390 6.87 6.96
(a) 5-Dimension
Batch-size
28 Dimension
Time (Sec) Speed-up w.r.t. GLPK
GLPK BLPG BLPG-SM Vs BLPG Vs BLPG-SM
1 0.0000 0.0010 0.0034 0.00 0.00
50 0.0080 0.0020 0.0020 4.00 4.00
100 0.0180 0.0020 0.0030 9.00 6.00
500 0.0860 0.0270 0.0280 3.19 3.07
1000 0.1350 0.0160 0.0160 8.44 8.44
1500 0.2170 0.0250 0.0270 8.68 8.04
2000 0.2870 0.0320 0.0310 8.97 9.26
5000 0.6850 0.0770 0.0740 8.90 9.26
10000 1.3210 0.1540 0.1450 8.58 9.11
20000 2.5420 0.3070 0.3010 8.28 8.45
50000 6.4190 0.7580 0.7210 8.47 8.90
100000 13.1350 1.5170 1.4460 8.66 9.08
(b) 28-Dimension
Batch-size
50 Dimension
Time (Sec) Speed-up w.r.t. GLPK
GLPK BLPG BLPG-SM Vs BLPG Vs BLPG-SM
1 0.001 0.002 0.002 0.50 0.50
50 0.025 0.004 0.004 6.25 6.25
100 0.051 0.005 0.005 10.20 10.20
500 0.196 0.016 0.016 12.25 12.25
1000 0.377 0.033 0.029 11.42 13.00
1500 0.539 0.045 0.043 11.98 12.53
2000 0.711 0.059 0.055 12.05 12.93
5000 1.797 0.148 0.135 12.14 13.31
10000 3.462 0.293 0.269 11.82 12.87
20000 6.914 0.588 0.534 11.76 12.95
50000 17.359 1.444 1.329 12.02 13.06
(c) 50-Dimension
Batch-size
100 Dimension
Time (Sec) Speed-up w.r.t. GLPK
GLPK BLPG BLPG-SM Vs BLPG Vs BLPG-SM
1 0.002 0.003 0.004 0.67 0.50
50 0.099 0.013 0.040 7.62 2.48
100 0.178 0.018 0.018 9.89 9.89
500 0.681 0.070 0.050 9.73 13.62
1000 1.628 0.105 0.095 15.50 17.14
1500 2.400 0.153 0.153 15.69 15.69
2000 3.283 0.200 0.184 16.42 17.84
5000 7.900 0.486 0.435 16.26 18.16
10000 15.695 0.956 0.860 16.42 18.25
20000 31.283 1.904 1.714 16.43 18.25
50000 78.280 4.778 4.277 16.38 18.30
(d) 100-Dimension
Batch-size
300 Dimension
Time (Sec) Speed-up w.r.t. GLPK
GLPK BLPG BLPG-SM Vs BLPG Vs BLPG-SM
1 0.015 0.025 0.025 0.60 0.60
50 0.677 0.074 0.074 9.15 9.17
100 1.337 0.124 0.122 10.78 10.96
500 6.621 0.472 0.424 14.03 15.62
1000 13.410 0.921 0.832 14.56 16.12
1500 20.373 1.567 1.226 13.00 16.62
2000 25.834 2.335 1.921 11.06 13.45
(e) 300-Dimension
Batch-size
500 Dimension
Time (Sec) Speed-up w.r.t. GLPK
GLPK BLPG BLPG-SM Vs BLPG Vs BLPG-SM
1 0.045 0.074 0.046 0.61 0.98
50 2.133 0.234 0.172 9.12 12.40
100 4.242 0.339 0.274 12.51 15.48
500 21.395 1.480 1.309 14.46 16.34
1000 43.084 2.844 2.572 15.15 16.75
1500 62.819 4.295 3.846 14.63 16.33
2000 75.288 5.730 5.143 13.14 14.64
(f) 500-Dimension
FIGURE 7 Time taken to compute a batch of LPs of dimensions 5, 28, 50, 100, 300 and 500 respectively, for the type of LPs
with initial basic solution as feasible.
overhead of data exchange between the two kernels), we observe a better performance. We gain a maximum speed-up of nearly
12×, for a batch of 10푘 LPs of dimension 200.
On profiling BLPG-SM, we observe that for small sized LPs, the processing time of the kernel is much larger than the data
transfer time, as shown in Table 3 . As a result, the gain in performance due to overlapping data transfer with kernel execution
is negligible. This is evident from the results in Figures 7 a,7 b and 7 c. As the LP size increases, the volume of data transfer
also significantly increases. Hence, the operation of data transfer for all the streams (except the first) can be overlapped while the
18 Amit Gurung ET AL
Batch-size
5 Dimension 28 Dimension 50 Dimension 100 Dimension 200 Dimension
Time(Sec)
Speed-up
Time(Sec)
Speed-up
Time(Sec)
Speed-up
Time(Sec)
Speed-up
Time(Sec)
Speed-up
GLPK BLPG GLPK BLPG GLPK BLPG GLPK BLPG GLPK BLPG
50 0.001 0.001 1.00 0.004 0.001 4.00 0.014 0.004 3.50 0.056 0.009 6.22 0.195 0.021 9.29
100 0.001 0.001 1.00 0.009 0.002 4.50 0.028 0.005 5.60 0.113 0.012 9.42 0.371 0.035 10.60
500 0.009 0.002 4.50 0.049 0.007 7.00 0.140 0.017 8.24 0.499 0.046 10.85 1.786 0.154 11.60
1000 0.017 0.004 4.25 0.093 0.012 7.75 0.260 0.032 8.13 0.975 0.093 10.48 3.540 0.295 12.00
1500 0.030 0.006 5.00 0.135 0.018 7.50 0.381 0.047 8.11 1.465 0.130 11.27 5.303 0.454 11.68
2000 0.043 0.008 5.38 0.184 0.025 7.36 0.504 0.063 8.00 1.942 0.176 11.03 7.092 0.603 11.76
5000 0.089 0.018 4.94 0.410 0.059 6.95 1.224 0.153 8.00 4.812 0.428 11.24 17.638 1.639 10.76
10000 0.163 0.036 4.53 0.783 0.111 7.05 2.454 0.303 8.10 9.689 0.859 11.28 35.498 2.969 11.96
20000 0.333 0.069 4.83 1.575 0.223 7.06 4.899 0.609 8.04 19.225 1.872 10.27 70.805 6.110 11.59
TABLE 4 Performance comparison between GLPK and BLPG for LPs with initial basic solution as infeasible
first kernel is in execution, thereby saving the time for data transfer in the rest of the streams. This results in 2−3% performance
gain for LPs of large sizes, as evident from Figure 7 d, 7 e and 7 f.
Batch-size - -> 1 10 100 1000 5000 10000 50000 100000
Sl No Benchmarks Rows Cols Gflop/s
1 ADLITTLE 71 97 0.09 0.92 7.51 8.27 8.76 8.81 8.84 8.84
2 AFIRO 35 32 0.10 0.98 7.64 10.01 11.08 11.19 11.26 11.27
3 BLEND 117 83 0.15 1.34 6.17 9.90 10.40 10.47 10.49 10.48
4 ISRAEL 174 142 0.57 4.46 12.46 15.35 15.58 15.61 15.62 15.61
5 SC105 150 103 0.48 3.73 13.87 15.48 15.66 15.72 15.73 15.74
6 SC205 296 203 0.95 7.18 13.67 15.11 15.25 16.94 16.94 16.93
7 SC50A 70 48 0.22 2.20 13.14 14.69 15.43 15.52 15.55 15.56
8 SC50B 70 48 0.22 2.17 13.04 14.52 15.31 15.40 15.47 15.47
TABLE 5 Performance of batched LP solver on a GPU
Table 6 shows a comparative evaluation of our solver to CPLEX and GLPK, on a set of selected LPs from the Netlib
repository. The experimental platform is a 4 Core Intel Xeon CPU E5-1607 v4, 3.10 GHz, 63 GB RAM with Nvidia’s Tesla
K40m GPU. The LP benchmarks in the Netlib repository are present in MPS format. We use the MATLAB’s built-in function
mpsread to read the benchmarks and then convert them into the standard form denoted by Equations 1 and 2. The converted
sizes of the benchmarks is shown in Table 5 with column heading “Rows” indicating the number of converted constraints
and “Cols” indicating the size of the benchmark. The table also shows the number of floating point operations per second (in
Giga flops), giving an estimation of the floating point computations in the Simplex algorithm and the utilization of GPU by our
proposed batched LP solver. We use the visual profiler (nvvp) available in the CUDA Toolkit (45). The batched LP solver gives
a maximum of 16.93 Gflops/s for a batch size of 100퐾 LPs on a Nvidia Tesla K40m GPU that has a theoretical peak of 1.43
Tflops/s, for double precision arithmetic.
Amit Gurung ET AL 19
Batch-size - -> 1 5 10 100 500 1000 5000 7000 10000 50000 70000 100000
Sl No Benchmarks Solver Time (seconds)
1 ADLITTLE 0.004 0.021 0.041 0.412 2.073 4.160 20.349 30.598 44.606 206.700 290.597 417.344
2 AFIRO 0.003 0.015 0.032 0.302 1.615 3.218 15.751 22.699 33.064 170.798 221.479 317.248
3 BLEND CPLEX could not read the input file (SIF format) available in the Netlib repository
4 ISRAEL 0.006 0.030 0.058 0.590 2.967 5.914 29.581 41.956 59.328 296.644 415.485 593.782
5 SC105 0.003 0.020 0.033 0.330 1.777 3.512 17.852 24.637 35.336 179.058 250.131 355.747
6 SC205 0.004 0.024 0.049 0.451 2.371 4.638 25.262 32.681 52.666 264.870 373.853 533.821
7 SC50A 0.003 0.016 0.033 0.334 1.641 3.124 16.825 24.141 32.390 164.437 231.550 331.341
8 SC50B 0.003 0.016 0.033 0.320 1.942 3.223 16.214 24.086 33.883 162.142 228.364 329.694
1 ADLITTLE
GLPK
0.003 0.012 0.019 0.131 0.632 1.244 6.234 8.692 12.410 62.161 87.436 124.501
2 AFIRO 0.000 0.001 0.002 0.013 0.054 0.092 0.482 0.680 1.008 4.795 6.675 9.634
3 BLEND 0.004 0.007 0.016 0.154 0.727 1.443 7.216 10.094 14.383 72.137 101.153 143.507
4 ISRAEL 0.006 0.026 0.047 0.430 2.152 4.331 21.545 30.169 42.484 212.950 299.988 427.886
5 SC105 0.004 0.014 0.022 0.168 0.805 1.594 7.981 11.194 15.975 79.423 111.737 159.471
6 SC205 0.007 0.029 0.056 0.603 2.786 6.077 30.505 42.704 60.662 301.120 421.265 595.468
7 SC50A 0.001 0.005 0.004 0.045 0.190 0.388 1.903 2.654 3.810 18.971 26.231 37.958
8 SC50B 0.001 0.005 0.004 0.048 0.197 0.406 2.003 2.791 4.040 19.907 27.861 39.531
1 ADLITTLE
BLPG
0.071 0.069 0.069 0.096 0.453 0.821 3.870 5.383 7.664 37.896 53.067 75.716
2 AFIRO 0.003 0.003 0.003 0.008 0.022 0.043 0.180 0.253 0.344 1.727 2.375 3.332
3 BLEND 0.073 0.073 0.080 0.174 0.543 1.072 5.097 7.116 10.136 50.579 72.430 103.297
4 ISRAEL 0.065 0.083 0.090 0.339 1.386 2.728 13.385 18.539 26.557 136.794 191.440 273.770
5 SC105 0.020 0.021 0.028 0.073 0.321 0.620 3.043 4.254 6.067 32.067 44.799 64.041
6 SC205 0.096 0.123 0.130 0.677 3.026 6.034 29.876 41.849 49.362 246.488 345.092 492.852
7 SC50A 0.005 0.005 0.005 0.009 0.035 0.069 0.331 0.462 0.655 3.262 4.564 6.529
8 SC50B 0.006 0.006 0.006 0.011 0.042 0.085 0.403 0.562 0.800 3.980 5.569 7.956
IBM 
CPLEX
(a) Execution time of IBM CPLEX, GLPK and the proposed batched GPU implementation using double precision
Batch-size - -> 1 5 10 100 500 1000 5000 7000 10000 50000 70000 100000
Sl No Benchmarks Speed-up
1 ADLITTLE 0.06 0.31 0.60 4.29 4.58 5.07 5.26 5.68 5.82 5.45 5.48 5.51
2 AFIRO 0.91 4.93 10.76 37.76 73.43 74.84 87.51 89.72 96.12 98.90 93.25 95.21
3 BLEND CPLEX could not read the input file (SIF format) available in the Netlib repository
4 ISRAEL 0.09 0.36 0.64 1.74 2.14 2.17 2.21 2.26 2.23 2.17 2.17 2.17
5 SC105 0.16 0.94 1.18 4.52 5.53 5.66 5.87 5.79 5.82 5.58 5.58 5.55
6 SC205 0.04 0.19 0.38 0.67 0.78 0.77 0.85 0.78 1.07 1.07 1.08 1.08
7 SC50A 0.64 3.10 6.56 37.16 46.88 45.28 50.83 52.25 49.45 50.41 50.73 50.75
8 SC50B 0.51 2.71 5.53 29.06 46.23 37.92 40.23 42.86 42.35 40.74 41.01 41.44
1 ADLITTLE
GLPK
0.04 0.17 0.28 1.36 1.40 1.52 1.61 1.61 1.62 1.64 1.65 1.64
2 AFIRO 0.00 0.33 0.67 1.63 2.45 2.14 2.68 2.69 2.93 2.78 2.81 2.89
3 BLEND 0.05 0.10 0.20 0.89 1.34 1.35 1.42 1.42 1.42 1.43 1.40 1.39
4 ISRAEL 0.09 0.31 0.52 1.27 1.55 1.59 1.61 1.63 1.60 1.56 1.57 1.56
5 SC105 0.20 0.67 0.79 2.30 2.51 2.57 2.62 2.63 2.63 2.48 2.49 2.49
6 SC205 0.07 0.24 0.43 0.89 0.92 1.01 1.02 1.02 1.23 1.22 1.22 1.21
7 SC50A 0.20 1.00 0.80 5.00 5.43 5.62 5.75 5.74 5.82 5.82 5.75 5.81
8 SC50B 0.17 0.83 0.67 4.36 4.69 4.78 4.97 4.97 5.05 5.00 5.00 4.97
BLPG 
Vs
IBM 
CPLEX
(b) Performance speed-up
TABLE 6 Performance evaluation on selected benchmarks from Netlib
Due to the LP size limitation of BLPG discussed earlier in Section 5.5, we choose benchmarks that satisfy this limitation. Table
6 a shows the execution time (in seconds) on the Netlib benchmarks by the three solvers and Table 6 b displays the performance
comparison of the two LP solvers with that of BLPG. Note that Netlib benchmarks are highly sparse in nature and LP solvers
such as IBM CPLEX and GLPK are optimized for sparse LPs. Our implementation is the original Simplex method proposed by
Dantzig and does not have any optimization for sparse LPs. We observe that in some benchmarks such as SC205 in Table 6
(and some others, not included in the table) CPLEX performs better than GLPK, whereas in other benchmark instances, GLPK
20 Amit Gurung ET AL
outperforms CPLEX. Our proposed BLPG achieves a maximum of 95× and nearly 6× speed-up on the Netlib benchmarks w.r.t.
CPLEX and GLPK respectively.
7 PERFORMANCE ONMOTIVATIONAL APPLICATION
In this section, we demonstrate the performance enhancement of state-space exploration of models of control systems, using our
batched LP solver. As discussed in Section 3, the state-space exploration routines in the state of the art tools requires solving
a large number of LPs. We consider two benchmarks, the Helicopter controller and a Five dimensional dynamical system for
its state-space computation using the tools SPACEEX-LGG and XSPEED. The Helicopter controller benchmark is a model of
a twin-engined multi-purpose military helicopter with 8 continuous variable modeling the motion and 20 controller variables
that governs the various controlling actions of the helicopter (46, 12). The Five dimensional dynamical system benchmark is a
model of a five dimensional linear continuous system as defined in (47). We direct the reader to the paper (47) for details on the
dynamics of the model.
In order to show the impact of our solver BLPG, we evaluate the performance of the tools by solving the resulting LPs
generated from the state-space exploration routines on these benchmarks. The LPs are sequentially solved using GLPK and in
parallel using BLPG. The performance comparison is shown in Table 7 , in an experimental setup of Intel Q9950 CPU, 2.84
Ghz, 4 Core (no hyper-threading), 8 GB RAM with a GeForce GTX 670 GPU. We observe a maximum speed-up of 12× and
9× in the tools performance with parallel solving of the LPs in BLPG w.r.t. sequential solving in GLPK. When compared to the
tool SPACEEX-LGG, we observe a maximum of 54× and 39× speed-up in XSpeed using our solver. Note that the LPs generated
by the tool on these benchmarks have the property that their feasible region is an hyper-rectangle. Therefore, BLPG solves these
using the technique mentioned in Section 5.6.
Benchmark
Time (in Secs) Speed-up w.r.t. XSpeed (GPU)
SpaceEx Vs SpaceEx
20010 0.133 0.345 0.018 7.4 19.2
100050 0.717 1.399 0.060 12.0 23.3
1000500 6.695 24.171 0.576 11.6 42.0
2001000 13.128 59.996 1.121 11.7 53.5
56056 1.400 4.399 0.172 8.1 25.6
1569568 39.089 123.794 4.246 9.2 29.2
2002000 50.367 187.825 5.397 9.3 34.8
3003000 75.087 311.652 8.055 9.3 38.7
Nos. of 
LPs XSpeed 
(Seq)
XSpeed 
(GPU)
Vs XSpeed 
(Seq)
Five 
Dimensional
System
Helicopter 
Controller
TABLE 7 Performance Speed-up in XSpeed using Hyperbox LP Solver
Amit Gurung ET AL 21
8 CONCLUSION
Solving a linear program on a GPU for an accelerated performance on a CPU-GPU heterogeneous platform has been extensively
studied. To the best of our knowledge, all such work report a performance gain only on linear programs of large size. We present
a solver implemented in CUDA that can accelerate applications having to solve small to medium size LPs, but a large number of
them. Our solver batches the LPs in an application and solves them in parallel on a GPU using the simplex algorithm. We report
significant performance gain on benchmarks in comparison to solving them in CPU using GLPK and CPLEX solvers. We show
the utility of our solver in an application of state-space exploration of models of control systems which involves solving many
small to medium size LPs, by showing significant performance improvement.
ACKNOWLEDGEMENTS
This work was supported by the National Institute of TechnologyMeghalaya, India and by the the DST-SERB, GoI under project
grant No. YSS/2014/000623. The authors thank Santibrata Parida, for his help in experimental evaluations.
References
[1] Birk Matthias, Dapp Robin, Ruiter Nicole V, Becker Jürgen. GPU-based iterative transmission reconstruction in 3D ultrasound computer tomography.
Journal of Parallel and Distributed Computing. 2014;74(1):1730–1743.
[2] Yan Guorui, Tian Jie, Zhu Shouping, Dai Yakang, Qin Chenghu. Fast cone-beam CT image reconstruction using GPU hardware. Journal of X-Ray Science
and Technology. 2008;16(4):225–234.
[3] Michalakes John, Vachharajani Manish. GPU Acceleration of Numerical Weather Prediction. Parallel Processing Letter. 2008;:1–18.
[4] Zhang Jian, McQuillan Ian, Wu FangXiang. Speed Improvements of Peptide - Spectrum Matching Using SIMD Instructions. IEEE International
Conference on Bioinformatics and Biomedicine Workshops (BIBMW). 2010;:1–22.
[5] Clark Michael A, La Plante PC, Greenhill Lincoln J. Accelerating radio astronomy cross-correlation with graphics processing units. International Journal
of High Performance Computing Applications. 2012;:1094342012444794.
[6] Karimi Kamran, Dickson Neil, Hamze Firas. High-Performance physics simulations using multi-core CPUs and GPGPUs in a volunteer computing
context. International Journal of High Performance Computing Applications. 2011;25(1):61–69.
[7] Lim Rone Kwei, Pro J William, Begley Matthew R, Utz Marcel, Petzold Linda R. High-performance simulation of fracture in idealized âĂŸbrick and
mortarâĂŹcomposites using adaptive Monte Carlo minimization on the GPU. International Journal of High Performance Computing Applications.
2015;:1094342015593395.
[8] Shirahata Koichi, Sato Hitoshi, Suzumura Toyotaro, Matsuoka Satoshi. A GPU Implementation of generalized graph processing algorithm GIM-V.
Proceedings - 2012 IEEE International Conference on Cluster Computing Workshops, Cluster Workshops 2012. 2012;:207–212.
[9] Chandra Jeya, Schall Susan O. Economic justification of flexible manufacturing systems using the Leontief input-output model. The Engineering
Economist. 1988;34(1):27–50.
[10] Munkres James. Algorithms for the assignment and transportation problems. Journal of the Society for Industrial and AppliedMathematics. 1957;5(1):32–
38.
[11] Drozdowski Maciej, Lawenda Marcin, Guinand Frédéric. Scheduling multiple divisible loads. International Journal of High Performance Computing
Applications. 2006;20(1):19–30.
22 Amit Gurung ET AL
[12] Frehse Goran, Le Guernic Colas, Donzé Alexandre, et al. SpaceEx: Scalable Verification of Hybrid Systems. In: Ganesh Gopalakrishnan Shaz Qadeer,
ed. Proc. 23rd International Conference on Computer Aided Verification (CAV), LNCSSpringer; 2011.
[13] Ray Rajarshi, Gurung Amit, Das Binayak, Bartocci Ezio, Bogomolov Sergiy, Grosu Radu. XSpeed: Accelerating Reachability Analysis on Multi-core
Processors. In: Piterman Nir, ed. Hardware and Software: Verification and Testing - 11th International Haifa Verification Conference, HVC 2015, Haifa,
Israel, November 17-19, 2015, Proceedings, Lecture Notes in Computer Science, vol. 9434: :3–18Springer; 2015.
[14] Lalami Mohamed Esseghir, El-Baz Didier, Boyer Vincent. Multi GPU implementation of the simplex algorithm. In: ; 2011.
[15] Bieling Jakob, Peschlow Patrick, Martini Peter. An efficient GPU implementation of the revised simplex method. In: :1–8IEEE; 2010.
[16] Spampinato Daniele G., Elster Anne C.. Linear optimization on modern GPUs. In: :1–8IEEE; 2009.
[17] Ploskas Nikolaos, Samaras Nikolaos. Efficient GPU-based implementations of simplex type algorithms. Applied Mathematics and Computation.
2015;250:552–570.
[18] Meyer Xavier, Albuquerque Paul, Chopard Bastien. A multi-GPU implementation and performance model for the standard simplex method. In: :312–319;
2011.
[19] Lalami Mohamed Esseghir, Boyer Vincent, El-Baz Didier. Efficient implementation of the simplex method on a CPU-GPU system. In: :1999–2006IEEE;
2011.
[20] CPLEX IBM ILOG. V12. 1: UserâĂŹs Manual for CPLEX. International Business Machines Corporation. 2009;46(53):157.
[21] Makhorin Andrew. GNU Linear Programming Kit, v.4.37. http://www.gnu.org/software/glpk; 2009.
[22] Dantzig George B., Thapa Mukund N.. Linear Programming 1: Introduction. Springer; 1997.
[23] Haidar Azzam, Dong Tingxing, Luszczek Piotr, Tomov Stanimire, Dongarra Jack. Batched matrix computations on hardware accelerators based on GPUs.
The International Journal of High Performance Computing Applications. 2015;29(2):193–208.
[24] Abdelfattah Ahmad, Haidar Azzam, Tomov Stanimire, Dongarra Jack. Performance, design, and autotuning of batched GEMM for GPUs. In:
:21–38Springer; 2016.
[25] Jhurani Chetan, Mullowney Paul. A GEMM interface and implementation on NVIDIA GPUs for multiple small matrices. Journal of Parallel and
Distributed Computing. 2015;75:133–140.
[26] Anderson Michael J, Sheffield David, Keutzer Kurt. A predictive model for solving small linear algebra problems in gpu registers. In: :2–13IEEE; 2012.
[27] Villa Oreste, Gawande Nitin, Tumeo Antonino. Accelerating subsurface transport simulation on heterogeneous clusters. In: :1–8IEEE; 2013.
[28] Wainwright Ian, Sweden High Performance Consulting. Optimized LU-decomposition with full pivot for small batched matrices. In: ; 2013.
[29] Whaley R Clint. ATLAS (Automatically Tuned Linear Algebra Software). In: Springer 2011 (pp. 95–101).
[30] Geijn Robert, Goto Kazushige. BLAS (Basic Linear Algebra Subprograms). In: Springer 2011 (pp. 157–164).
[31] Ploskas Nikolaos, Samaras Nikolaos. GPU accelerated pivoting rules for the simplex algorithm. Journal of Systems and Software. 2014;96:1–9.
[32] Dongarra J, Grosse E, others . Netlib repository. 2006.
[33] Li Jianming, Lv Renping, Hu Xiangpei, Jiang Zhongqiang. A GPU-based parallel algorithm for large scale linear programming problem. In: Springer
2011 (pp. 37–46).
[34] Alur Rajeev, Courcoubetis Costas, Henzinger Thomas A, Ho Pei-Hsin. Hybrid automata: An algorithmic approach to the specification and verification of
hybrid systems. In: Springer 1993 (pp. 209–229).
[35] Girard Antoine, Le Guernic Colas. Efficient Reachability Analysis for Linear Systems using Support Functions. In: ; 2008.
[36] Tomlin John A. On scaling linear programming problems. Computational practice in mathematical programming. 1975;:146–166.
[37] Larsson T. On scaling linear programs–Some experimental results. Optimization. 1993;27(4):355–373.
[38] Elble Joseph M, Sahinidis Nikolaos V. Scaling linear optimization problems prior to application of the simplex method. Computational Optimization and
Applications. 2012;52(2):345–371.
[39] Ploskas Nikolaos, Samaras Nikolaos. A computational comparison of scaling techniques for linear optimization problems on a graphical processing unit.
International Journal of Computer Mathematics. 2015;92(2):319–336.
Amit Gurung ET AL 23
[40] Harris Mark. Optimizing parallel reduction in CUDA. NVIDIA Developer Technology. 2007;.
[41] Hennessy John L, Patterson David A. Computer architecture: a quantitative approach. Elsevier; 2011.
[42] Nvidia . CUDA C Programming Guide. 2015.
[43] Harris Mark. How to Overlap Data Transfers in CUDA C/C++ .
[44] Bradley Thomas. Hyper-Q Example. 2012.
[45] Nvidia . CUDA Toolkit Documentation v8.0 (2017). Nvidia Corporation. 2017;.
[46] Skogestad Sigurd, Postlethwaite Ian.Multivariable Feedback Control: Analysis and Design. John Wiley & Sons; 2005.
[47] Girard Antoine. Reachability of Uncertain Linear Systems Using Zonotopes. In: Morari Manfred, Thiele Lothar, eds. HSCC, Lecture Notes in Computer
Science, vol. 3414: :291-305Springer; 2005.
AUTHOR BIOGRAPHY
Amit Gurung is a Ph.D. student at the National Institute of Technology Meghalaya, India. He received
an MCA (Master in Computer Applications) degree in 2010 and B.Sc. in Computer Science in 1999 from
North Eastern Hill University, Shillong, India. His research includes High Performance Computing, Parallel
Programming and Reachability Analysis of Hybrid Systems.
Rajarshi Ray is an Assistant Professor at the department of Computer Science and Engineering, National
Institute of Technology Meghalaya. He received his B.Sc. in Computer Science from University of Pune in
2005 andM.Sc. in Computer Science from the ChennaiMathematical Institute in 2007. He received his Ph.D.
in Computer Science from Verimag, University of Grenoble in 2012. His research interests are in formal
methods in systems verification, cyber-physical systems and approximate computing.
24 Amit Gurung ET AL
