This research aims to accelerate the computation module in max-plus algebra using CUDA technology on graphics processing units (GPUs) designed for high-performance computing. Our target is the Kleene star of a weighted adjacency matrix for directed acyclic graphs (DAGs). Using a inexpensive GPU card for our experiments, we obtained more than a 16-fold speedup compared with an Athlon 64 X2.
Introduction
This research aims to accelerate the computation of the Kleene star [1] of weighted adjacency matrices in max-plus algebra [1] , [2] , using computers equipped with high performance graphics processing units (GPUs). We implement a program using CUDA (compute unified device architecture) technology [3] , which is available on recent NVIDIA GPUs.
The Kleene star plays an essential role in max-plus algebra approaches to scheduling problems for repetitive discrete event systems (DESs). To be precise, the governing equation in max-plus algebra, referred to as the state equation, includes the Kleene star in the transition matrix [4] .
Hereafter, we focus on DESs whose behavior can be described by a directed acyclic graph (DAG). Let the number of nodes and arcs in the system be n and m, respectively. If we compute the Kleene star based on the most efficient algorithm known thus far, the time complexity is O(n · (n + m)) [4] , [5] . On the other hand, the state equation includes other addition and multiplication operations, the worst time complexity of which is O(n 2 ). Thus, the bottleneck in computing the state equation lies in the Kleene star.
In the field of high performance computing, on the other hand, much attention has been paid to the concept of general-purpose computing on graphics processing units (GPGPU). In particular, recent GPU cards produced by NVIDIA Corporation provide substantial benefits for parallel computation, and the company itself supplies an easyto-implement environment for developers and researchers. Recently, the effectiveness and advantages of using GPUs for technical computations have been widely reported [6] - [8] .
In view of this, we aim to accelerate the computation of the Kleene star in max-plus algebra by implementing code for CUDA GPUs. Then, we measure the effective speedup using both single and multiple GPUs.
Target Algorithm
First, we introduce the specific notations and operation rules in max-plus algebra. Denoting the real field by R, we define a field R max = R ∪ {−∞}. Then, for x, y ∈ R max , we define operators and unit elements:
, and e (= 0).
For the unit matrices, ε is a matrix whose elements are all ε, while e is a matrix with diagonal elements set to e and off-diagonal elements to ε. Operator ⊗ has higher precedence than ⊕.
Let X ∈ R n×n max be a DAG weighted adjacency matrix. Our target algorithm is the computation of:
where [X] i j = {w i j : if there is an arc j → i, else ε}, and w i j is the weight of arc j → i. If we denote the number of nodes by n, there is an instance r that satisfies X ⊗(r−1) ε and X ⊗r = ε (1 ≤ r ≤ n). It is known that [X * ] i j gives the maximum value of the cumulative weights for paths from node j to node i.
Amongst the most efficient algorithms for computing the Kleene star in terms of time complexity, the method in [5] is attractive, since the work matrix can be partitioned into arbitrary column major blocks and each block can be processed independently. The essential procedures and time complexities are given below.
• Topological sort, O(m+n): sort the nodes in topological order based on a depth first search (DFS) algorithm [9] by inspecting the elements of X.
• Initialization, O(n 2 ): prepare and initialize a work matrix W ∈ R n×n max .
• Update, O(m · n): update the work matrix according to
for all succeeding nodes i of source nodel, wherel represents the original node number of sequence l in the topologically sorted graph. Then, repeat this for all l (1 ≤ l ≤ n − 1) in ascending order.
On completion of these procedures, the values in the resulting matrix are given by the elements in W. We note here
Copyright c 2011 The Institute of Electronics, Information and Communication Engineers that the third process corresponds to an elementary transformation in conventional algebra.
CUDA Architecture
The basic structure of a CUDA GPU is depicted in Fig. 1 . We suppose here that only a single GPU card is installed on the target PC. In CUDA terminology, the PC and GPU card are called the host and device, respectively. On the device side, there can be either a single or multiple processing units, referred to as the streaming multi-processor (SM). Each SM has eight scalar processors (SPs), a 16 KB shared memory, registers, and two types of caches. The number of SPs in an SM is not always eight; in recent high-end GPUs there are 32 SPs. The 16 KB shared memory is shared between the SPs and has small latency. Computational programs for the SPs are referred to as kernels, with each SP in charge of a task identified by a thread. In the video unit, depicted in the lower part of the figure, there are three types of memories: global, constant, and texture memories, which are shared between all SMs. To communicate data between the host and device, we must use the global memory, but its latency is quite large. On the other hand, the texture and constant memories are read-only for SMs and accessed data are cached. Thus, the latency can be significantly reduced by accessing the same or adjacent data multiple times. Owing to there being various types of memories in CUDA GPUs, we have to consider well in advance which memories to use, in order to exploit the benefits for computation speed.
Implementation
First, we improve the algorithm to reduce the required memory. In existing methods, the update process is performed using [X] il in the original adjacency matrix X. This implies allocating sufficient memory to store two n × n matrices: X and W. On the other hand, the number of non-ε (non-zero, in conventional algebra) elements, denoted by m, follows m ≤ n · (n − 1)/2 because we are focusing on DAGs. Thus, using a full matrix workspace is redundant for large scale systems.
In view of this, we first convert X to a compressed form and the remaining procedures are performed using the compressed data. It should be noted that the target algorithm occasionally needs the list of succeeding nodes for a given source node. As a format suited to this, we adopt the compressed column storage (CCS) format [10] . Let the integer field be denoted by Z, then the compression result yields the following three arrays.
• Val (∈ R m max ): stores the values of non-ε elements in X in column major order.
• Idx (∈ Z m max ): stores the corresponding row numbers of the elements in array 'Val'.
• Ptr (∈ Z Once the memory space for these arrays has been prepared, if the original matrix X is not needed after the Kleene star computation, this space can be reused for the work matrix W.
We now implement the code for CUDA. As shown in the next section, the bottleneck in the Kleene star computation lies in the update process. Thus, we optimize this part extensively.
In preparation, floating point memory storage for the work matrix W is prepared in global memory. This matrix is initialized to e, where we use (-FLT MAX) to represent ε. Moreover, we prepare two arrays for storing 'Val' and 'Ptr' in texture memory. As pointed out in the previous section, several alternative memories are available. In fact, we experimented with code that used the shared and constant memories, but the performance thereof was not good. Thus, we opted to use texture memory.
Then, W is updated sequentially in topological order from upstream source nodes to downstream ones. Figure 2 depicts the update process for source nodel. The list of succeeding nodes, in other words destination nodes, is obtained from Idx(S), where S = Ptr(l) : (Ptr(l + 1) − 1). Let an element from S and the number of elements of S be denoted as i k ∈ S and |S| = s, respectively. First, the values of [X] i k l and i k (1 ≤ k ≤ s), which are obtained from Val(S) and Idx(S), respectively, are transferred from host memory to texture memory. Next, the values of [W] l j (1 ≤ j ≤ n) are transferred to texture memory. Then, we invoke a kernel to update [W] 
On the kernel side, each invoked thread retrieves the value of the target element [W] if [W] l j = ε. As implied by the above, a huge number of conditional branches occur in max-plus algebra operations. Since there is no branch predictor in GPU processors, this feature may be disadvantageous; dissimilar to floating point computations in conventional algebra.
The kernel is invoked for every source node with one or more succeeding nodes. We illustrate the allocation of blocks and threads in the kernel in Fig. 3 . The kernel includes c × b two-dimensional blocks, with each block having C × B two-dimensional threads, where b = n/B and c = s/C . In current NVIDIA GPUs, B · C ≤ 512 must be followed, and B should be a multiple of 16 for efficient access to global memory, known as coalescing [3] . Thus, B and C should be set with care. We should also note here that the update location for W is continuous with respect to row order, but scattered with respect to column order. After all updates for the source nodesl (1 ≤ l ≤ n − 1) have been completed, the values of X * are stored in W, and the resulting array is transferred from device to host.
For simplicity, we assume that only a single GPU is available in the current explanation. However, if multiple GPUs are available simultaneously, the work matrix can be partitioned column-wise into an arbitrary number of different sized blocks, and the update process can be executed independently in parallel.
Performance Evaluation
The performance of the proposed algorithm is measured using a PC equipped with CUDA GPUs. We use a PC installed with (a) an AMD Athlon TM 64 X2 Dual Core 5600+ 2.90 GHz running Linux Fedora 13 for x86-64, and two NVIDIA GPU cards, namely (b) a GeForce GTS 250 and (c) a Quadro NVS 420. Table 1 shows the specifications of the two GPU cards, noting that cards (b) and (c) have one and two GPUs on a single card, respectively. The compilation and execution environments are: CUDA driver version: 195.36.31; software development kit (SDK) version: 3.0; gcc version 4.4.4 with '-march=athlon64-sse3 -O3' compilation options; and nvcc for CUDA version V0.2.1221 with the '-O3' option. Now we prepare an adjacency matrix X and compute the Kleene star X * . For X, we first attach arcs i → i+1 for all i (1 ≤ i ≤ n − 1), and then append arcs j → i (1 ≤ j ≤ n − 1, j + 1 ≤ i ≤ n) with 1/2 probability. The weights of these arcs obey a [0, 1] normal distribution. Then, we sort the indices of the nodes randomly, and swap the corresponding rows and columns. Each experiment is performed three times with the same random seed, and the median computation time is adopted.
First, we measure the performance using only the CPU (a). Table 2 shows the computation times in milliseconds with a varying number of nodes n = 500, 1000, 2000, and 4000. As the results clearly indicate, the procedure for updating W is the bottleneck and this part requires extensive fine-tuning. Recalling that the time complexity for the update is O(m · n) and noting m n 2 /4 holds in this experiment, the computation time would increase eight fold if n were doubled. This estimation actually holds true for larger n. Table 3 shows the computation times for updating W with varying block sizes B and C for n = 4000 using GPU (b) or (c). It appears that setting B = 64 or B = 32 is acceptable, although the times are not remarkably different for each B. Table 4 shows the computation times for computing X * and the speedup effect compared with CPU (a). The speedup is defined as (computation time using a CPU) / (computation time using GPU(s)). Cases 1 and 2 use a single GPU, whereas case 3 uses two GPUs simultaneously. In using multiple GPUs simultaneously, we must invoke multiple threads on the CPU side, known as the POSIX thread, and this requires some overhead for invocation and termination.
In case 1, the speedup is evident as n increases, but it seems to level off between 17 and 18. In cases 2 and 3, the speedup appears to level off around 1.2 and 2.3, respectively. From Table 1 , the theoretical computation performance can be estimated by multiplying the number of cores by the SP clock speed. In this context, the ratio of the computations speed for case 1 to case 2 should be approximately 17.1, but the actual ratio was approximately 13.7 for n = 4000. This difference may be due to the relatively higher time to convert X to the CCS format in case 1 compared with case 2. We note here that the results using GPUs matched those using the CPU exactly, which indicates that the computation precision of GPUs conforms to standard single precision.
Conclusion
In this research, we focused on accelerating the computation of the Kleene star in max-plus algebra using CUDA GPUs. The primary target was updating the work matrix, the process of which is similar to elementary transformations of a matrix in conventional algebra. Since the SPs do not have a branch predictor, they are not naturally efficient for max operations. Nevertheless, we accomplished a speedup greater than 16-fold with a reasonable GPU, compared with an Athlon 64 X2 including the SSE3 optimization option. Accordingly, it is expected that the speedup would be much more significant using high-end GPUs.
