Linear Programs (LPs) appear in a large number of applications. Offloading the LP solving tasks to a GPU is viable to accelerate an application's performance. Existing work on offloading and solving an LP on a GPU shows that performance can be accelerated only for large LPs (typically 500 constraints, 500 variables and above). This paper is motivated from applications having to solve small LPs but many of them. Existing techniques fail to accelerate such applications using GPU. We propose a batched LP solver in CUDA to accelerate such applications and demonstrate its utility in a use case -state-space exploration of models of control systems design. A performance comparison of The batched LP solver against sequential solving in CPU using the open source solver GLPK (GNU Linear Programming Kit) and the CPLEX solver from IBM is also shown. The evaluation on selected LP benchmarks from the Netlib repository displays a maximum speed-up of 95× and 5× with respect to CPLEX and GLPK solver respectively, for a batch of 1e5 LPs.
INTRODUCTION
A linear program (LP) is an optimization problem with a linear cost function subject to a search space defined by a conjunction of linear constraints. The size of a LP is measured by the number of decision variables and the number of linear constraints that it contains. LPs appear in a variety of applications. This work is motivated from applications that require solving a large number of LPs of small size (less than 500 variables and 500 constraints). Traditional CPU computations are now increasingly being carried out on CPU-GPU heterogeneous systems, by offloading data parallel tasks to a GPU for accelerating performance. We propose a hybrid CPU-GPU LP solver which can solve a batch of many LPs simultaneously. Our work assumes the setting that computations begin in a CPU where LPs are created, batched and then offloaded to a GPU for an accelerated solution. The solutions are transferred back to the CPU ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only. ICPE '19, April 7-11, 2019, Mumbai, India © 2019 Association for Computing Machinery. ACM ISBN 978-1-4503-6239-9/19/04. . . $15.00 https://doi.org/10.1145/3297663.3310308 from the GPU for further processing. As an example, we show an application of our batched LP solver in an algorithm for state-space exploration of models of control systems. In model-based design, state-space exploration is a standard analysis technique. State of the art methods and tools for state-space exploration heavily rely on solving many independent LPs [8, 24] . Moreover, the LPs are generally of small size. We show that using our batched LP solver, the state-space exploration tools can improve their performance significantly. Prior work on solving a LP on a GPU and on multi-GPU architectures are many [3, 16, 17, 23, 26] . The focus of all such works has been on methods to improve the performance of algorithms to solve one LP. Performance gain is reported generally when offloading large LPs of size 500 (500 constraints, 500 variables) and above [3, 17, 26] . It is seen that for small LPs, the time spent in offloading the problems from CPU to GPU memory is more than the time saved with parallel execution in the GPU. Although modern LP solvers like CPLEX [5] and GLPK [19] are very efficient in solving small LPs, solving many of them one by one may consume considerable time. Note that using any of the prior work to solve a LP on GPU will not accelerate our target applications since they perform well only on large LPs. We show that with batched computation, performance acceleration can be achieved even for small LPs (e.g. LPs of size 5) for a considerably large batch size, where batch size refers to the number of LPs in a batch. We present a CUDA C++ implementation of a solver which implements the simplex method [6] , with an effort to keep coalescent memory accesses, efficient CPU-GPU memory transfer and an effective load balancing. To the best of our knowledge, this is the first work in the direction of batched LP solving on a GPU. The solver source can be found at https://bitbucket.org/rajgurung777/simplexprojects. Beyond a sufficiently large batch size, our implementation shows significant gain in performance compared to solving them sequentially in the CPU using the GLPK library [19] , an open source LP solver and the CPLEX solver from IBM. The evaluation on selected LP benchmarks from the Netlib repository displays a maximum speed-up of 95× and 5× with respect to CPLEX and GLPK solver respectively, for a batch of 1e5 LPs. In addition, we consider a special class of LPs with feasible region as an hyper-rectangle and exploit the fact that these can be solved cheaply without using the simplex algorithm. We implement this special case LP solver as part of the solver.
MOTIVATING APPLICATION
In model-based design of control systems, a standard technique of analysis is to compute the state-space of the model using exploration algorithms. Properties of the control system such as safety and stability are analyzed by observing the computed state-space. In this section, we consider two open-source tools that perform state-space exploration of control systems with linear dynamics, namely SpaceEx [8] and XSpeed [24] . These tools can analyze systems modeled using a mathematical formalism known as hybrid automaton [2] . A conservative over-approximation of the exact state space is computed by both the tools. The state of the art state-space exploration algorithm in these tools compute the state-space as a union of convex sets, each having a symbolic representation, known as the support function representation [10] . The algorithm requires a conversion of these convex sets from its symbolic support function representation to concrete convex polytope representation, in order to have certain operations efficient. This conversion involves solving a number of linear programs. Moreover, the precision of the conversion and consequently, the precision of the computed state-space depends on the number of LPs solved. Table 1 shows the number of LPs and its dimension that these tools solve for a fairly accurate state-space computation over a time horizon of just 100 seconds, on some standard control systems benchmarks. We see that the number of LPs to be solved in the above examples is in the order of 1e9 which cannot be solved in practical time limits even by the fast modern LP solvers like GLPK or CPLEX, when solved sequentially. For instance, to compute the state-space of a Filtered Oscillator model using the tool XSpeed, it requires solving 7.2e7 LPs [11, 24] . Note that although a solver like CPLEX take approximately 0.003 seconds to solve an LP of dimension 32 in a modern CPU, it will take nearly 60 hours to solve 7.2e7 LPs sequentially. Therefore, we believe that there is a need to accelerate applications where such bulk LP solving is necessary.
LINEAR PROGRAMMING
A linear program in standard form is maximizing an objective function under the given set of linear constraints. The objective function is denoted as n j=1 c j x j and the set of linear constraints is given by n j=1 a i j x j ≤ b i and x j ≥ 0, for i = 1, 2, ...,m and j = 1, 2, ...,n. The inequality n j=1 a i j x j ≤ b i is the set of m constraints over n variables and x j ≥ 0 is the non-negativity constraints over n variables. An LP in standard form can be converted into slack form by introducing m additional slack variables (x n+i ), one for each inequality constraint, to convert it into an equality constraint, as shown below:
An algorithm that solves LP problems efficiently in practice is the simplex method described in [6, 15] . The variables on the left-hand side of the Equation (1) are referred as basic variables and those on the right-hand side are non-basic variables. The initial basic solution of an LP is obtained by assigning its non-basic variables to zero. The initial basic solution may not be always feasible (when one or more of the b i 's are negative, resulting in the violation of the non-negativity constraint). For such LPs, the simplex method employs a two-phase algorithm. In the first phase, a new auxiliary LP is formed by having a new objective function z, which is the sum of the newly introduced artificial variables. The simplex algorithm is employed on this auxiliary LP and it is checked if the optimal solution to the objective function is zero. If a zero optimal is found then it implies that the original LP has a feasible solution and the simplex method initiates the second phase. In the second phase, the feasible slack form obtained from the first phase is considered and the original objective function is restored with appropriate substitutions and elimination of the artificial variables. The simplex algorithm is then employed to solve the LP.
Prior to the simplex method, many LP solvers apply preconditioning techniques such as a simple geometric mean scaling in combination with equilibration to reduce the condition number of the constraint matrix in order to decrease the computational effort for solving an LP [7, 18, 22, 27] . In this work, we do not apply any pre-conditioning on the LP for simplicity and use the simplex algorithm described in the following section.
The Simplex Algorithm
The simplex algorithm is an iterative process of solving a LP. Each iteration of the simplex algorithm attempts to increase the value of the objective function by replacing one of the basic variables (also known as the leaving variable), by a non-basic variable (called the entering variable). The exchange of these two variables is obtained by a pivot operation [1] . The index of the leaving and the entering variables are called the pivot row and pivot column respectively. The simplex algorithm iterates on a tabular representation of the LP, called the simplex tableau. The simplex tableau stores the coefficients of the non-basic, slack and artificial variables in its rows. It contains auxiliary columns for storing intermediate computations. In our implementation, we consider a tableau of size p × q, where p = m + 1 and q = n + sum of slack and artificial variables + 2. The (m + 1)th row stores the best solution to the objective function found until the last iteration, along with the coefficients of the non-basic variables in the objective function. 
Index of basic variables

Bounds of the constraints
Coefficients of non-basic variables
Coefficients of slack variables
Coefficients of artificial variables unused
Optimal Solution
Coefficients of non-basic variable in objective function (used to determine entering variable) Figure 1 : Formation of the Simplex Tableau.
There are two auxiliary columns, the first column stores the index of the basic variables and the second stores b i 's of equation (1) . Figure 1 shows a schematic of the simplex tableau.
Step 1) Determine the entering variable: At each iteration, the algorithm identifies a new entering variable from the non-basic variables. It is called an entering variable since it enters the set of basic variables. The choice of the entering variable is with the goal that increasing its value from 0 increases the objective function value. The index of the entering variable is referred to as the pivot column. The most common rule for selecting an entering variable is by choosing the index e of the maximum in the last row of the simplex tableau (excluding the current optimal solution).
Step 2) Determine the leaving variable: Once the pivot column is determined (say e), the algorithm identifies the row index with the minimum positive ratio (b i / − a i,e ), say ℓ, called the pivot row. The variable x ℓ is called the leaving variable because it leaves the set of basic variables. This ratio represents the extend to which the entering variable x e (in Step 1) can be increased without violating the constraints.
Step 3) Obtain the new improved value of the objective function: The algorithm then performs the pivot operation which updates the simplex tableau such that the new set of basic variables are expressed as a linear combination of the non-basic ones, using substitution and rewriting. An improved value for the objective function is found after the pivot operation.
The above steps are iterated until the halt condition is reached. The halt condition is met when either the LP is found to be unbounded or the optimal solution is found. An LP is unbounded when no new leaving variable can be computed, i.e., when the ratio
Step 2 is either negative or undefined for all i. An optimal solution is obtained when no new entering variable can be found, i.e., the coefficients of the non-basic variables in the last row of the tableau are all negative values
SIMULTANEOUS SOLVING OF BATCHED LPS ON A GPU
We present our CUDA implementation that solves batched LPs in parallel on a GPU. In this discussion, we shall refer a CPU by host and a GPU by device. The LP batching is performed on the host and transferred to the device. Our solver implementation assumes that all the LPs in a batch are of the same size. The batch size is adjustable, depending on the device memory size and LP size. Our batching routine considers the maximum batch size that can be accommodated in the device memory.
Memory Transfer and Load Balancing
First, we allocate device memory (global memory) from the host, that is required for creating a simplex tableau for the LPs in the batch. The maximum number of LPs that can be batched depends on the size of the device global memory in the device. The tableau for every LP in the batch is populated with all the coefficients and indices of the variables in the host side, before transferring to the device. To speed-up populating the tableau in the host, we initialize the tableau in parallel using OpenMP threads. Once initialized, the Simplex tableaux are copied from the host to the device memory (referred to as H2D-ST in Figure 3 ). The LP batching routine is shown in Algorithm 1. Lines 2 to 5 computes the basic operations such as obtaining the available size of global memory in GPU, memory requirement of an LP, number of threads for an LP and the number of LPs in a batch. Line 7 determines the number batches required to execute the GPU kernel (line 15). Lines 8 to 14 shows the computation of appropriate indices from the data-structure listLP, so that the device memory devLP can load appropriate batches of LP. The GPU kernel modifies the tableau to obtain solutions using the simplex method for every LP in the batch and copies back from the device to the host memory (referred to as D2H-res in Figure 3 ).
We discuss further on our CPU-GPU memory transfer using CUDA streams for efficiency in Section 4.4.
Algorithm if N > batchSize then 7: batches = ceil (N ÷ batchSize) 8 :
devLP ← copy listLP from index (start to end) 15: batchKernel (batchSize,threadSize,devLP,R) 16 :
devLP ← copy listLP from index (1 to N ) 18: batchKernel (batchSize,threadSize,devLP,R)
Load Balancing. We assign a CUDA block of threads to solve an LP in the batch. Since blocks are scheduled to Streaming Multiprocessors (SMs), this ensures that all SMs are busy when there are sufficiently large number of LPs to be solved in the batch. As CUDA blocks execute asynchronously, such a task division emulates solving many LPs independently in parallel. Moreover, each block is made to consist of j (≥ q) threads, which is a multiple of 32, as threads in GPU are scheduled and executed as warps. The block of threads is utilized in manipulating the simplex tableaux in parallel, introducing another level of parallelism.
Implementation of the Simplex Algorithm
Finding the pivot column in Step 1 of the simplex algorithm above requires to determine the index of the maximum value from the last row of the tableau. We parallelize Step 1 by utilizing n (out of j) threads in parallel to determine the pivot column using parallel reduction described in [13] . A parallel reduction is a technique applied to achieve data parallelism in GPU when a single result (e.g. min, max) is to be computed from an array of data. We have implemented a parallel reduction by using two auxiliary arrays, one for storing the data and the other for storing the array indices of the corresponding data. The result of a parallel reduction provides us the maximum value in the data array and its corresponding index in the indices array.
We also apply parallel reduction in Step 2 by utilizing m (out of j) threads in parallel to determine the pivot row (m being the rowsize of the simplex tableau). It involves finding a minimum positive value from an array of ratios (as described in Step 2 above) and therefore ratios which are not positive needs to be excluded from the minimum computation. This leads to a conditional statement in the parallel reduction algorithm and degrades the performance due to warp divergence. Even if we re-size the array to store only the positive values, the kernel still contains conditional statements to check the threads that need to process this smaller size array. To overcome performance degradation with conditional statements, we substituted a large positive number in place of ratios that are negative or undefined. This creates an array that is suitable for parallel reduction in our kernel implementation.
Data parallelism is also employed in the pivot operation in Step 3, involving substitution and re-writing, using the (m − 1) threads (out of j threads in the block).
Coalescent Memory Access
In this section, we discuss our efforts on keeping a coalescent access to global memory to reduce performance loss due to cache misses. When threads in a warp access contiguous locations in the memory, the access is said to be coalescent. A coalescent memory access results in performance benefits due to an increased cache hit rate.
As discussed earlier, we use global memory to store the simplex tableaux of the LPs in a batch as described in Section 4.1 (Since the global memory in a GPU is of the maximum size in the memory hierarchy, it can accommodate many tableaux). We store the simplex tableau in memory as a two-dimensional array. High level languages like C and C++ use the row-major order by default for representing a two-dimensional array in the memory. CUDA is an extension to C/C++ and also use the row-major order. The choice of row or column major order representation of two-dimensional arrays plays an important role in deciding the efficiency of the implementation, depending on whether the threads in a warp access the adjacent rows or adjacent columns of the array and what is the offset between the consecutive rows and columns.
We use the term column-operation, when elements of all rows from a specific column are accesses simultaneously by each thread in a warp. If the array is in a row-major order, then this operation is not a coalesced memory access, as each thread access elements from the memory separated by the size equal to the column-width of the array. When elements of a specific row are accessed simultaneously by threads of a warp, we called this a row-operation. Note that for a two dimensional array stored in row-major order, a row-operation is coalesced since each thread access data from contiguous locations in the memory.
We now show that in the simplex algorithm described above, there are more column-operations than row operations and thus, storing our data (i.e. simplex tableau) in a column-major order ensure more coalesced memory access in comparison to having a row-major storage.
Step 1 of the simplex algorithm determines the entering variable (also known as the pivot column), which requires finding the index of the maximum positive coefficient from the last row. This requires a row-operation and as mentioned in Section 4.2, we use parallel reduction using two auxiliary arrays, Data and Indices. Although accessing from the last row of the simplex tableau is not coalesced (due to our column-major ordering) but copying into the Data (and Indices) array is coalesced and so is the parallel reduction algorithm on the Data (and Indices) array. We use the technique of Parallel Reduction: Sequential Addressing in [13] , a technique that ensures coalesced memory access.
Step 2 of the simplex algorithm determines the leaving variable (also called the pivot row) by computing the row index with the minimum positive ratio (b i / − a i,e ), as described in Section 3.1. This requires two column-operations involving the access to all elements from columns b and x e as shown in Figure 2 . To compute the row index with the minimum positive ratio, we use parallel reduction as described above in Section 4.2. Our tableau being stored in a column-major order, access to columns b and x e are both coalesced. The ratio and its corresponding indices (represented by the thread ID) are stored in the auxiliary arrays, Data and Indices which is also coalesced. Like in Step 1, we use the same technique of Parallel Reduction: Sequential Addressing in [13] for coalesced memory access.
Step 3 performs the pivot operation that updates the elements of the simplex tableau and is the most expensive of the three steps. It first involves a non-coalescent row-operation which computes the new modified pivot row (denoted by the index ℓ) as {N ewPivotRow ℓ = OldPivotRow ℓ ÷ PE}, where PE is the element in cell at the intersection of the pivot row and the pivot column for that iteration, known as the pivot element. The modified row (N ewPivotRow ℓ ) is then substituted to update each element of all the rows of the simplex tableau, using the formula N ewRow i j = OldRow i j −PivotCol ie * N ewPivotRow ℓj . The elements of the pivot column are first stored in an array named PivotCol which is a column-operation, and so is coalesced, due to the column-major representation of the tableau. The crucial operation is updating each j t h element for every i t h row (except the pivot row ℓ) of the simplex tableau, which requires a nested for-loop operation. We parallelize the outer for-loop that maps the rows of the simplex tableau. Our data being represented in a column-major order, parallel access to all rows for each element in the j t h column of the inner for-loop is coalesced.
Data Indices
To verify the performance gain due to coalesced memory access, we experiment with Step 3 which is the most expensive of the three steps in the simplex algorithm, by modifying it to have noncoalesced memory accesses. We interchange the inner for-loop with the outer loop (loop interchange, a common technique to improve cache performance [14] ). This loop interchanging converts the Step 3 to have non-coalesced memory access since our simplex tableau is represented in a column-major order. Table 2 presents the experimental results to show the gain in performance when the accesses to memory are coalesced, as compared to having noncoalesced memory accesses. The results show a significant gain in performance on a Tesla K40c GPU, for LPs with the initial basic solution as feasible. We observe that Step 1 has a row-operation, Step 2 has two column-operations and Step 3 has a row and a column operation each with a nested for-loop which can be expressed both in row as well as column operations. We see that there are more column operations than row operations.
LP Dim Batch-size
Overlapping data transfer with kernel operations using CUDA Streams
The memory bandwidth of host-device data copy is a major bottleneck in CUDA applications. We use nvprof [20] to profile time for memory transfer and kernel operation of our implementation discussed above in Section 4. The result of profiling in a Tesla K40c GPU, for LPs with an initial basic solution as feasible, is reported in Table 3 . We observe that, for a small batch size (e.g. 10 in the Table  3 ), the memory copy operation takes a maximum of 5.75% of the execution time, whereas for larger batch size (rows in gray colour), the memory copy operation takes 6% to 15% of the execution time. A standard technique to improve the performance in CPU-GPU heterogeneous applications is by using CUDA streams. CUDA streams allow overlapping memory copy operation with kernel execution. A stream in CUDA consists of a sequence of operations executed on the device in the order in which they are issued by the host procedure. These operations can not only be executed in an interleaved manner, but also be executed concurrently in order to gain performance [12] . A GPU in general, has a separate kernel and a copy engine. All kernel operations are executed using the kernel engine and memory copy operations to and from the device are performed by the copy engine. However, some GPUs have two copy engines, one each for copying data from host to device and from device to host, for better performance. Figure 3 illustrates the overlapping of kernel executions with memory copy operation, when the GPU has one kernel and one copy engine. Streaming by batching similar operations causes more overlap of copies with kernel executions, as depicted in the figure. Adding all host-to-device copy to the CUDA streams followed by all kernel launches and device-to-host data copies, can result in a significant overlap of memory copy operations with kernel executions, resulting in a performance gain. When there are two copy engines, looping the operations in the order of a host-to-device copy followed by kernel launch and deviceto-host copy, for all streams may result in a better performance than the former method. For GPUs with compute capability 3.5 and above, both the methods result in the same performance due to the Hyper-Q [4] feature.
Although a large number of CUDA streams achieves more concurrency and interleaving among operations, it incurs stream creation overhead. The number of CUDA streams that gives optimal performance is found by experimentation. From our experimental observations, we conclude that with varying the batch size and the LP dimension, the optimal number of streams also varies. In this paper, we report results with 10 streams for batch size more than 100 LPs. We use a single stream when the batch size is less than 100 (for LPs of any dimension).
Limitations of the Implementation
As the current limit on threads per block is 1024 in CUDA, our implementation limits the size of LPs having initial basic solution as feasible to 511 × 511. The size limit for LPs having initial basic solution as infeasible is 340 × 340. This limit is derived from (2):
where var is the number of variables (dimension of the LP problem), slack is the number of slack variables and arti is the number of artificial variables in the given LP. The number 2 indicates the use of two auxillary columns described in Section 3.1.
Solving a Special Case of LP
The feasible region of an LP given by its constraints defines a convex polytope. We observe that when the feasible region is a hyper-rectangle, which is a special case of a convex polytope, the LP can be solved cheaply. Equation (3) shows that maximizing the objective function is the sum of the dot products of n terms.
where ℓ ∈ R n is the sampling directions over the given hyper-
An implementation for solving this special case of LPs is incorporated in our solver. In order to solve many LPs in parallel, we organize CUDA threads in a one-dimensional block of threads with each block used to solve an LP. Each block is made to consist of only 32 threads, the warp size. Within each block, we used only a single thread to perform the operations of the kernel. A preliminary introduction to this technique is introduced in the paper [24] .
EVALUATION
We evaluate our solver on two set of LPs. The first is a set of randomly constructed LPs. The LPs in this set are constructed by randomly selecting the coefficients of the constraint matrix A from a range of [1...1000], the bounds of the constraints b from a range of [1...1000] and the coefficients of the objective functions c from the range of [1...500] respectively. The second set of LPs are selected from the Netlib repository. On both the set of LPs, we evaluate the performance of our batched solver on varying batch sizes. In the text that follows, we refer to our solver on a GPU as BLPG, abbreviating Batched LP solver on a GPU. The solver using CUDA streams is referred as BLPG-SM. We perform our experiment in Intel Xeon E5-2670 v3 CPU, 2.30 GHz, 12 Core (without hyper-threading), 62 GB RAM with Nvidia's Tesla K40c GPU. The reported performance is an average over 10 runs. A performance comparison of our solver to GLPK is shown in Figure 4 , for the first set of randomly generated LPs of various sizes and various batch sizes. We observe a maximum speed-up of 16× for LPs having the initial basic solution as feasible, for a batch of 2K LPs of dimension 100. For the same type of LPs, we see a maximum speed-up of 18×, for a batch of 5K LPs of dimension 100, using BLPG-SM. We observe that for LPs of large size, BLPG performs better even with a few LPs in parallel (e.g., batch size=50 for a 500 dimensional LP). However, for small size LPs, BLPG out-performs GLPK only for larger batch sizes (e.g. a batch size of 100 for a 5 dimensional LP). Figure 5 shows a performance comparison of BLPG with GLPK on LPs having the initial basic solution as infeasible. In spite of the fact that BLPG executes the kernel twice due to the two-phase simplex algorithm as discussed in Section 3 (an extra overhead of data exchange between the two kernels), we observe a better performance. We gain a maximum speed-up of nearly 12×, for a batch of 10K LPs of dimension 200.
On profiling BLPG-SM, we observe that for small sized LPs, the processing time of the kernel is much larger than the data transfer time. As a result, the gain in performance due to overlapping data transfer with kernel execution is negligible. This is evident from the results in Figure 4a . As the LP size increases, the volume of data transfer also significantly increases. Hence, the operation of data transfer for all the streams (except the first) can be overlapped while the first kernel is in execution, thereby saving the time for data transfer in the rest of the streams. This results in performance gain up to 2% to 3% for LPs of large sizes, as evident from Figures  4b and 4c . Table 5 shows a comparative evaluation of our solver to CPLEX and GLPK, on a set of selected LPs from the Netlib repository. The experimental platform is a 4 Core Intel Xeon CPU E5-1607 v4, 3.10 GHz, 63 GB RAM with Nvidia's Tesla K40m GPU. The LP benchmarks in the Netlib repository are present in MPS format. We use the MATLAB's built-in function mpsread to read the benchmarks and then convert them into the standard form. The converted sizes of the benchmarks is shown in Table 4 with column heading "Rows" indicating the number of converted constraints and "Cols" indicating the size of the benchmark. The table also shows the number of floating point operations per second (in Giga flops), giving an estimation of the floating point computations in the Simplex algorithm and the utilization of GPU by our proposed batched LP solver. We use the visual profiler (nvvp) available in the CUDA Toolkit [21] . The batched LP solver gives a maximum of 16.93 Gflops/s for a batch size of 100K LPs on a Nvidia Tesla K40m GPU that has a theoretical peak of 1.43 Tflops/s, for double precision arithmetic.
Due to the LP size limitation of BLPG discussed earlier in Section 4.5, we choose benchmarks that satisfy this limitation. Table 5 shows the performance comparison of the two LP solvers with that of BLPG, on the Netlib benchmarks. Note that Netlib benchmarks are highly sparse in nature and LP solvers such as IBM CPLEX and GLPK are optimized for sparse LPs. Our implementation is the original Simplex method proposed by Dantzig and does not have any optimization for sparse LPs. We observe that in some benchmarks such as SC205 in Table 5 (and some others, not included in the table) CPLEX performs better than GLPK, whereas in other benchmark instances, GLPK outperforms CPLEX. Our proposed BLPG achieves a maximum of 95× and nearly 6× speed-up on the Netlib benchmarks w.r.t. CPLEX and GLPK respectively.
Motivational Application
In this section, we demonstrate the performance enhancement of state-space exploration of models of control systems, using our batched LP solver. As discussed in Section 2, the state-space exploration routines in the state of the art tools requires solving a large number of LPs. We consider two benchmarks, the Helicopter controller and a Five dimensional dynamical system for its statespace computation using the tools SpaceEx-LGG and XSpeed. The Helicopter controller benchmark is a model of a twin-engined multipurpose military helicopter with 8 continuous variable modeling the motion and 20 controller variables that governs the various controlling actions of the helicopter [8, 25] . The Five dimensional dynamical system benchmark is a model of a five dimensional linear continuous system as defined in [9] . We direct the reader to the paper [9] for details on the dynamics of the model.
In order to show the impact of our solver BLPG, we evaluate the performance of the tools by solving the resulting LPs generated from the state-space exploration routines on these benchmarks. The LPs are sequentially solved using GLPK and in parallel using BLPG. The performance comparison is shown in Table 6 , in an experimental setup of Intel Q9950 CPU, 2.84 Ghz, 4 Core (no hyperthreading), 8 GB RAM with a GeForce GTX 670 GPU. We observe a maximum speed-up of 12× and 9× in the tools performance with parallel solving of the LPs in BLPG w.r.t. sequential solving in GLPK. When compared to the tool SpaceEx-LGG, we observe a maximum of 54× and 39× speed-up in XSpeed using our solver. Note that the LPs generated by the tool on these benchmarks have the property that their feasible region is an hyper-rectangle. Therefore, BLPG solves these using the technique mentioned in Section 4.6. 
Benchmark
Future Work
As the current limit on threads per block is 1024 in CUDA, our implementation limits the size of LPs having initial basic solution as feasible to 511 × 511 and 340 × 340 for LPs having initial basic solution as infeasible. We intend to address this limitation in future work. Another limitation of the solver is the LPs in the batch have to be the same size. Although this limitation is not a concern for our motivational application in particular, but the solver will be useful in more general applications without this limitation.
CONCLUSION
Solving a linear program on a GPU for an accelerated performance on a CPU-GPU heterogeneous platform has been extensively studied. To the best of our knowledge, all such work report a performance gain only on linear programs of large size. We present a solver implemented in CUDA that can accelerate applications having to solve small to medium size LPs, but a large number of them. Our solver batches the LPs in an application and solves them in parallel on a GPU using the simplex algorithm. We report significant performance gain on benchmarks in comparison to solving them in CPU using GLPK and CPLEX solvers. We show the utility of our solver in an application of state-space exploration of models of control systems which involves solving many small to medium size LPs, by showing significant performance improvement.
