This article presents a GPU-based single-unit deadlock detection methodology and its algorithm, GPU-OSDDA. Our GPU-based design utilizes parallel hardware of GPU to perform computations and thus is able to overcome the major limitation of prior hardware-based approaches by having the capability of handling thousands of processes and resources, whilst achieving real-world run-times. By utilizing a bit-vector technique for storing algorithm matrices and designing novel, efficient algorithmic methods, we not only reduce memory usage dramatically but also achieve two orders of magnitude speedup over CPU equivalents. Additionally, GPU-OSDDA acts as an interactive service to the CPU, because all of the aforementioned computations and matrix management techniques take place on the GPU, requiring minimal interaction with the CPU. GPU-OSDDA is implemented on three GPU cards: Tesla C2050, Tesla K20c, and Titan X. Our design shows overall speedups of 6-595X over CPU equivalents.
Introduction
Modern systems are becoming increasingly complex, with hundreds or thousands of concurrent processes and resources being utilized and sharing resources at any given instant. This increased level of complexity has led to the higher possibility of systems entering a deadlock state. A deadlock in our context is a situation in which two or more competing processes are each waiting for the other to finish, and thus neither ever does [1] .
In the past, many software-based deadlock detection algorithms [2] [3] [4] [5] [6] were written, but they lacked the speed necessary to make them viable in real world systems. As a result, researchers have developed hardware-based algorithms [7] [8] [9] [10] [11] that expanded upon the findings of these software algorithms. Hardware algorithms led to very fast and deterministic results but lacked the ability to handle an increasing amount of processes and resources (due to size constraints and hardware complexity), as would be seen in a real world system. In lieu of these findings though, we hypothesized that by adopting the methodologies found in the hardware approaches and exploiting their parallel nature on GPU, we may be able to devise a practical solution to the deadlock detection problem in systems with a large number of processes and resources (e.g., 4096 processes and 4096 resources). By applying the hardware algorithmic methodologies and performing creative GPU optimizations, we have been able to provide a deadlock detection approach applicable to real world systems (e.g., computation time of a less than a milisecond in a 4096 process ×4096 resource system).
For this reason, we propose GPU-OSDDA, a GPU-based single-unit deadlock detection algorithm, which keeps track of all resource allocation events on the GPU. It is the CPU's responsibility to pass resource allocation event information to the GPU for computation. In this way, GPU-OSDDA serves as an interactive service to the CPU. When we mention the words interactive service, we refer to the limited interaction that the CPU is required to have with our algorithm. GPU-OSDDA is meant to run in the background, receiving resource event information from the Operating System (OS) and then provide notification to the CPU in the event of any deadlock occurring in the system. In this way, our algorithm provides an unobtrusive notification to the CPU (or OS) regarding the state of its resource events.
A summary of our contribution is as follows: (1) Proposing novel algorithmic methods for GPU acceleration of deadlock detection, achieving two orders of magnitude speedup; (2) Utilizing a bit-vector technique for storing algorithm matrices, thus reducing memory usage dramatically; (3) Handling thousands of processes and resources, while achieving real-world run-times; (4) Offering as an interactive service to the OS, requiring minimal interaction with the CPU; (5) Bridging the gap between problem size and run-time of deadlock detection algorithms.
Background

Related Work
There have been a multitude of software-based deadlock detection algorithms proposed in the past that handle resource events in single-unit resource systems. In 1972, Holt [2] first introduced a resource allocation graph-based deadlock detection approach that had an O(m × n) run-time complexity, where m and n are the process and resource amounts, respectively. Following this development, Leibfried [6] designed an algorithm that utilized the adjacency matrix. Leibfried's approach used matrix multiplication in order to determine reachability information, which led to an algorithm with an O(m 3 ) run-time complexity. Later, Kim and Koh [4] devised a tree-based algorithm that improved upon the prior deadlock detection run-time. Their tree-based algorithm was able to detect deadlock in O(1) run-time; however, the caveat to this approach was that it required an O(m + n) run-time for the resource release phase of the algorithm. The completion of this release phase was required for the algorithm to handle the next invocation of deadlock detection.
In recent years, there has been a progression towards parallel hardware-based algorithms to detect deadlock. These algorithms are deterministic and have accomplished low run-time complexities in hardware. One of such algorithms, known as HDDU, was developed by Xiao and Lee [8] in 2007. This algorithm had a deadlock detection run-time of O(1) and a detection preparation run-time of O(min(m,n)). Later, Xiao and Lee developed a new approach to classifying resource events in a single-unit resource system. This development led to a new algorithm known as O(1) Single-Unit Deadlock Detection Algorithm (OSDDA) [9] . By utilizing the new classification of resource events, deadlock preparation was able to be completed in O(1) time. As a result, OSDDA was able to achieve an overall run-time complexity of O(1) in hardware. The basis of our algorithm is rooted in the methodology of OSDDA [9] . The core operation of OSDDA is based upon a classification of re-source events in the system which is further discussed in Section 2.4. Note that the problem of deadlock detection for multi-unit resources is out of scope of this article.
The RAG and its Adjacency Matrix Representation
The events occurring between processes and resources of a system are represented as a bipartite graph known as the Resource Allocation Graph (RAG). It contains two disjoint sets: a process set P and a resource set Q. There are two types of edges between these disjoint sets. The first edge type is known as a resource request edge. It is a directed edge from a process node p i in set P to a resource node q j in set Q that denotes process p i has requested resource q j . The second edge type is known as a resource grant edge. This edge is a directed edge running from a resource node q j in Q to a process node p i in P that denotes resource q j has been granted to process p i . The RAG can also be represented as two separate adjacency matrices: Adjacency Request (AR) and Adjacency Grant (AG), which hold the resource request and grant information, respectively [8] . AG[] and AR[] can be defined as follows:
where 1≤ i ≤ m and 1≤ j ≤ n.
Terms and Assumptions
For the sake of understanding the computations of deadlock detection in GPU-OSDDA, we here introduce relevant terms.
Definition 1.
An interactive service is an application or program that runs in the background with limited interaction with the operating system. Definition 2. A single-unit resource is a resource that serves at most one process at any given instant.
Example single-unit resources can be as simple as a USB port, a printer port, or a network port, or as complex as files, memory pages, or every computing node in a cluster of machines or cloud.
Definition 3. A single-unit request system is a system in which a process may request only one unit at a time and thus has at most one outstanding request [12] .
Definition 4.
A system is in an expedient state if any request for an available unit is granted immediately [2] .
The system under consideration is a single-unit resource system with m number of processes and n number of resources, which we refer to as an m × n system. It is also a single-unit request and expedient system. In such a system, a single-unit request also means requesting a single resource only per request (i.e., a request for multiple resources by one command is not valid).
Definition 5.
A sink process node is a non-blocked process node (no outgoing edge) with at least one granted resource (incoming edge) [2] . Definition 6. An active process is a process which has no pending resource request (no outgoing edge) but may have granted resources (incoming edges).
Definition 7.
A node v i is reachable from a node v j if and only if there exists a path that starts from v j and ends at v i [2] . Thus, v i is called a reachable node of v j . Figure 1 illustrates a Resource Allocation Graph (RAG) containing three processes and three resources. Solid arrows denote resource grant edges, and dotted arrows denote request edges. Black dots mean resource units. In the figure, p 0 is an active process and its node is a sink node, whereas p 1 and p 2 are blocked waiting for q 2 . At the moment, p 0 is a reachable sink node of all resources. The proposed algorithm adheres to the following assumptions:
(1) Each resource contains a single unit (see Definition 2) . Thus, a cycle in the RAG is a necessary and sufficient condition for deadlock [9] . (2) A process requests one resource at a time (see Definition 3). Thus, a process is blocked as soon as it requests an unavailable resource [2] . (3) A resource is granted to a process immediately if the resource is available.
As a result, the entire system is always in an expedient state (see Definition 4) [2] . (4) Resource events are managed centrally (e.g., by the OS).
These assumptions are very typical ones made in deadlock research. Note that the concerns of livelock, priority inversion, etc. are out of scope of this article.
Underlying Theory
OSDDA is truly unique because its overall algorithm run-time is O(1) in hardware. It is able to achieve this by performing parallel computations on a RAG based on the classification of resource events in the system. The three types of resource events in OSDDA are: granted resource requests, blocked resource requests, and resource release [9] . The resource events and deadlock detection capability of OSDDA are briefly discussed in Sections 2.4.1 and 2.4.2.
Resource Events
Let us first discuss the resource request granted event. For this event to occur, when a process p i requests a resource q j , q j has to be available (not granted to another process) and p i needs to be an active process. This means that resource q j must not have any incoming or outgoing edge and process p i may have incoming edges but no outgoing edge (since the system is in an expedient state (see Definition 4)). If these criteria are satisfied, resource q j may be granted to process p i . This event causes q j to change its reachable sink node.
Next we'll discuss the resource request blocked event. When a process p i requests a resource q j and there is no available unit of q j , the process p i is blocked. Prior to the request, process p i has to be an active process, and thus, p i has no outgoing edge when the request is made. By definition of an active process, p i could have already been granted resources and as a result have incoming edges. Furthermore, resource q j has an outgoing edge (as it is not available) and may have incoming edges (pending requests). As a result, two scenarios of the resource request blocked event exist:
(1) Block (i) -Before the request is blocked, p i has no incoming edges; q j has one outgoing edge; q j may or may not have incoming edges. After the request is blocked, only a request edge p i → q j is inserted in the RAG [9] . (2) Block (ii) -Before the request is blocked, p i has incoming edges; q j has one outgoing edge; q j may or may not have incoming edges. After the request is blocked, for all resources belonging to p i 's sub-tree, their sink nodes are set to q j 's sink node and their reachable processes and resources are also updated to include q j 's reachable nodes [9] .
Finally, we'll discuss the resource release event. For process p i to release its resource, it must be an active process by having no outgoing edge. While servicing the resource release event, the algorithm must determine if resource q j has any pending resource requests (incoming edges). If q j has a pending request from a process p t , q j is granted to p t after the release due to the system being in an expedient state. Depending on if resource q j has pending requests, two separate resource release scenarios exist.
(1) Release (i) -Before released, q j has no incoming edges; p i may have one or more incoming edges. After the resource is released, only a grant edge q j → p i is removed in the RAG, and thus, q j is no longer reachable to p i [9] . (2) Release (ii) -Before released, q j may have one or more incoming edges; p i may have one or more incoming edges; p t may or may not have incoming edges. After release, q j is assigned to p t . In this case, the sink nodes of all of q j 's sub-tree resources are changed to p t , and also they are no longer reachable to p i [9] .
Figure 2(a) shows an example RAG of a 3×3 system consisting of three processes (p 0 , p 1 , p 2 ) and three resources (q 0 , q 1 , q 2 ). Since this is a single-unit system, each resource has one unit. Accompanying the RAG in Figure 1 are the associated adjacency matrices AG and AR that are formed via the prior definitions. In the RAG, there exist three resource grant edges (q 0 → p 0 , q 1 → p 2 , and q 2 → p 0 ) and two resource request edges (p 1 → q 0 and p 2 → q 2 ). On each resource event, the operating system sends the event information to the GPU so that it may update its RAG (i.e., AG and AR) and initiate deadlock detection if necessary.
Furthermore, by looking at Figure 2 (b), it can be seen that under the resource release event where p 0 releases q 0 , q 0 is then granted to the blocked process p 1 . This is an example of the system being in an expedient state. Notice also that AG and AR have been updated accordingly. Lastly, in Figure 2 (c), process p 0 requests q 0 and is blocked due to q 0 having been granted to process p 1 .
O(1) Deadlock Detection
It is known that as long as the sink process node for every resource in the system has been identified, then deadlock can be detected in O(1) time as reported in [4] and [8] . We know the reachable sink process node of a resource q j is process p i if and only if p i is a sink process node and a path from resource q j to process p i exists. A cycle occurs in the system when the sink process node (say p i ) of a resource (say q j ) requests the resource. By our system assumptions, a cycle in the RAG is a necessary and sufficient condition for deadlock, and thus, under this scenario, a deadlock exists.
To achieve O(1) run-time of deadlock detection, OSDDA maintains the sink information for all resources in the system for use in upcoming invocations of deadlock detection. The sink information is stored in a matrix known as Sink.
Furthermore, for a release (ii) event, the OSDDA algorithm needs to identify resources on the sub-tree of the released resource (q j ) as well as those on the subtree of the process acquiring q j [8] . For this, OSDDA utilizes the ReachableResource or RR and the ReachableProcess or RP matrices to maintain information on what resources and processes are reachable from every resource, respectively [8] . The matrices are defined as follows:
1 if a path exists from resource q j to q k or k = j, 0 otherwise.
1 if a path exists from resource q j to p i , 0 otherwise.
Our Methodology
Introduction
During the development process of our preliminary version of GPU-OSDDA, we implemented two versions, one with characters and the other with integers to represent our matrix elements. For both the character and integer-based approaches, time was spent optimizing and tweaking GPU code to ensure that occupancy was high, coalesced memory accesses were occurring, and threads were kept busy, following GPU programming guidelines. This enabled us to maintain a high IPC ratio and maximize the memory bandwidth for our problem. We utilized the NVIDIA Compute Visual Profiler to gauge our results at each step and came to a point where we were satisfied with the optimizations. However, with all optimizations complete, we were only able to achieve 3-24X speedup over our CPU implementation, dubbed CPU-OSDDA. These speedups, while an improvement, did not grant us the kinds of speedups we were looking for. Using our initial approach as a baseline for measuring performance, we began rework on the algorithm. Our new approach took a drastically different approach on matrix storage and algorithm computation. We thought this different approach was necessary in order to maximize speedup and yield an algorithm that would be applicable to real world systems. Thus, we decided to implement our entire algorithm with integer length bit-vectors. We hypothesized that this approach would reduce our memory footprint by a factor of 32, thus allowing for an increasing amount of processes and resources the algorithm could handle, as well as simplify and accelerate the bit-wise computations of our algorithm. Advantages of this approach are discussed throughout the remaining subsections, as well as how GPU-OSDDA handles each resource event type. This kind of GPU-targeted bit-packed approach has never been well reported in the literature as far as we know.
Novel Bit-Vector Design
Since GPU-OSDDA is based on a single-unit system, all values that indicate the state of a process or resource in the system can be represented as binary values (0,1). In this case, instead of using an 8 or 32-bit variable to hold a 1-bit value, we bitpack 32 processes or resources into a single 32-bit unsigned integer. Figure 3 shows how we would create a 128 × 128 adjacency matrix using 32-bit unsigned integers, where each box in the image represents a 32-bit unsigned integer. Similarly, we could create an adjacency matrix where the rows/columns are reversed. Additionally, Table 1 describes variables used throughout the remaining algorithm descriptions using Figure 3 . By using left bit-shifting to achieve multiplication and right bit-shifting to achieve division in our algorithm, the LIPR and LINTBITS values are necessary to gain proper offsets when calculating matrix indices. This is due to the fact that each bit-shift (left or right) implies a change in magnitude by a power of two. Now that all algorithm critical information has been discussed, we present the overall kernel structure with the pseudo-code in Algorithm 1. The pseudo-code presents the kernels called upon each resource event type. In lines 4-5, GPU-OSDDA handles the resource request granted event. The Request Granted kernel launches a single block containing a single thread to perform the updates discussed in Section 3.3. Following the resource grant, we update the reachability information in the system by calling the Release Update Reachability kernel in line 27. The release resource event is discussed in detail in Section 3.5.
Handling a Resource Request Granted Event
To handle a resource request granted event, GPU-OSDDA launches a kernel with a single block containing a single thread. The computation involved in this event does not advocate parallelism; however, GPU-OSDDA manages and maintains the RAG on the GPU, which makes it necessary to launch this small kernel. Algorithm 2 shows the assignments made in our kernel. Since we utilize a bit-packing technique to represent all algorithm matrices, we have to use a special method of referencing the correct process/resource pair for assignment.
In order to find the correct bit corresponding to the process in the grant event, handled by the Request Granted computation, we perform in line 3 the modulo of p i by INT BITS. After obtaining the correct bit in the integer, we find the exact integer index to be altered in each adjacency matrix. This is computed in line 4 by left-shifting the row we want (q j ) by log 2 (INTS PER ROW). To this we add the process number right-shifted by log 2 (INT BITS). The sum of these two numbers yields the index of the integer we want to alter in the adjacency matrix. This computation is very similar to calculating the global thread ID of one dimensional multiblock grid of a GPU programming model. Bit-shifting (left and right) are used instead of multiplication and division, respectively, for efficiency. Here the size of the matrices should be powers of two, so that we can perform the bit-shift operations for index calculations. Note that M and N do not need to be equal in size, but they must be powers of two. If M and N are not equal, the INTS PER ROW value will change depending on which matrix we address. For ease of explanation in this article, we assume that M and N are equal. Note however that for those cases where the number of processes or the number of resources is not power of two, a common zero-padding method can be used to make the matrix size power of two. Nonetheless, as information is bit-packed, it will incur neither much space overhead nor much computation work as padded values are all zeros.
In order to perform assignments to the adjacency matrices, we perform bit-wise OR computations with the appropriate mask. The mask is created by shifting a 1 into the location specified by the bit variable we calculated in line 3. Upon performing the bit-wise OR operations in lines 8-10, our Request Granted kernel is complete.
Algorithm 2 Request Granted≪1,1≫
1: // Refer to Table 1 for variable As a summary, in the Request Granted kernel, it can be seen that the AG[] matrix is updated to reflect the assignment q j → p i . Similarly, by resource q j being granted to p i , p i becomes the new reachable sink node of resource q j , denoted by the Sink[] matrix. It follows that process p i is reachable from resource q j as denoted by the RP[] matrix assignment. Figure 4 summarizes the actions taken by the resource request granted kernel. threadIdx.x = 0 
Handling a Resource Request Blocked Event
GPU-OSDDA handles a resource request blocked event through several stages. The initial step, which we denote as DeadlockCheck Init (Algorithm 3), checks whether or not the requesting process is the current sink node of the requested resource. According to [9] , if a resource request event occurs where the requesting process is the current sink node of the resource being requested, a cycle forms in the RAG and a deadlock occurs. Otherwise, AR[] is updated to reflect the blocked request of p i → q j . The DeadlockCheck Init kernel utilizes a similar technique seen in the Request Granted kernel to determine the indices needed for its computations. First, we determine which bit needs to be checked and/or set in the kernel. We first determine the bit to be checked in the Sink[] matrix in line 5. Since we want to check a single bit in an integer, we perform the modulo of process p i by INT BITS. Similarly, we need a bit for the AR[] matrix. The reason for building two separate indices is that the Sink[] and AR[] matrices take the form of resource × process and process × resource, respectively. For the AR[] matrix, we gain the bit to check by performing the modulo of resource q j by INT BITS in line 6. As can be seen in Algorithm 3, we continue by constructing two separate global indices; sidx and aidx in lines 7-8. The sidx index yields the position of the integer we want to check in the Sink[] matrix, while the aidx index yields the position of the integer for assignment in the AR[] matrix. The combination of both the global integer index and the associated bit index enables us to check or alter a single bit in the appropriate adjacency matrix.
If a deadlock occurs (checked in line 12), then we update the deadlock detection flag in line 13 for the CPU to handle the deadlock event. Otherwise, the resource request is blocked in line 15 by updating the value corresponding to the request p i → q j in the AR[] matrix.
Algorithm 3 DeadlockCheck Init≪1,1≫
1: // Refer to Table 1 If a deadlock does not occur, reachability information of the RAG needs to be updated if the requesting process holds additional resources. Otherwise, the reachable sink nodes do not change, so no additional computation is necessary. The task of updating reachability information for a RAG is computationally expensive, unlike O(1) of OSDDA [9] . Nevertheless, our implementation of the reachability update computation benefits greatly from the parallelism offered by the GPU and is further accelerated by our bit-vector approach to the algorithm. Algorithm 4 shows the pseudo-code for our Request Blocked kernel, which performs the reachability update. As can be seen by the kernel overview in Algorithm 1, we launch N blocks with INTS PER ROW threads per block. This kernel structure allows us to perform all bit-wise computations in this kernel simultaneously, except for the serialization of the Sink[], RR[], and RP[] updates per thread. The bit-vector approach we implement allows us to perform computations for 32 processes or resources per integer index in an adjacency matrix. This approach granted us a dramatic speedup in the run-time of our algorithm, which will be depicted in the Experimentation and Results section of this article. The first step in our Request Blocked kernel (Algorithm 4) is to determine the indices in our adjacency matrices (lines 3-7).
We also allocate a temporary sink matrix, Sink [9] in lines 13-15. Line 13 makes the sink node of all resources on the subtree of p i equal to q j 's sink node. Then in lines 14-15, the resources on p i 's subtree include the reachable resources and processes of q j . Following Algorithm 4, a summary of the operations performed for the Request Blocked kernel is provided. 9: // For all the resources that belong to the subtree of pi, 10: // their sink nodes are now set to qj 's; their reachable 11: // resource and process nodes include qj 's. 
Algorithm 4 Request Blocked≪N,IPR≫
As in [9] , for all resources belonging to p i 's sub-tree, their sink nodes are set to q j 's sink node. Their RR[] and RP[] matrices are also updated to include q j 's reachable nodes. The biggest advantage we gain during this computation is that per each assignment or bit-wise OR operation, we effectively update 32 process/resource pairs per thread. More specifically, each row of a matrix is handled per block with each column (integer) being handled by a thread. Figure 5 depicts the operations taking place during the reachability computation, where resources q 1 and q 127 are updated to have a sink node of p i , and q j is assumed to be q 0 (i.e., the first row). 
Handling a Resource Release Event
In handling a resource release event, GPU-OSDDA first has process p i release resource q j by updating AG[], as the pseudo-code in Algorithm 5 depicts. We first determine the bit that represents the process p i releasing resource q j in line 3. Following a familiar procedure, we compute p i modulo INT BITS to represent our process bit. Then to determine the integer index into the AG[] matrix that we need to alter, we compute q j left-shifted by log 2 (INTS PER ROW) to give us the proper row of AG[] in line 4. We then add this to p i right-shifted by log 2 (INT BITS) to give us the column integer that we want to address. This sum then yields the index to address in AG[]. From there, the operation in line 7 performs a bit-wise AND operation to update AG[] reflecting that process p i released resource q j .
Algorithm 5 Release Resource≪1,1≫
1: // Refer to Table 1 for variable GPU-OSDDA then checks if a process is waiting on q j by performing a reduction on AR trans[q j ][] (transpose of the AR[] matrix). In the event that this reduction returns 0, it informs us that no process is waiting on q j and that it belongs to a release event (i), explained in Section 2.4.1 and detailed in [9] . From there, GPU-OSDDA updates the Sink[] and RP[] matrices to indicate that q j has no sink and that p i is no longer reachable from q j . Algorithm 6 depicts the update process of Sink[] and RP[]. Since both Sink[] and RP[] are resource × process matrices, we are able to utilize the same bit and index to update necessary information. Notice the procedure in lines 3-4 in Algorithm 6 is similar in terms of finding the appropriate bit and index. After computing this information, we perform a bit-wise AND operation in lines 7 and 9 to clear the corresponding bit in the adjacency matrices.
Algorithm 6 Update Sink RP≪1,1≫
1: // Refer to Table 1 for variable If the reduction of AR[][q j ] is not equal to 0, this indicates that the release event belongs to the release (ii) scenario, explained in Section 2.4.1 and detailed in [9] . In this case, GPU-OSDDA updates AG[] and AR[] to indicate that the released resource is granted to a waiting process. Algorithm 7 depicts the update process for AG[] and AR[]. Since the AG[] and AR[] matrices take the form of resource × process and process × resource respectively, they both need their own bit and global index variables to update the correct bit. As performed for all of our updates thus far, we find the correct bit by computing the modulo of the bit we want with INT BITS in lines 5-6. Following that, lines 7-8 perform familiar computations to find the integer index into the adjacency matrix that we want to update. Finally, we update AG[] and AR[] by performing bit-wise OR operations on the corresponding index and bit in lines 11-12.
Algorithm 7 Update AG AR≪1,1≫
1: // Refer to Table 1 for variable Following this step, reachability and sink information needs to be updated. // For pt's sub-tree resources that were able to reach qj To start, in line 3 we create an array in shared memory called newSink [], which we use to update sink information later in the kernel. Lines 8-9 assign the block and thread variables to row and tid, respectively, which are used for calculating the bit, column, and index variables. It can be seen that the same familiar process to find needed bits (line 10), columns (line 11), and indices (line 12) has been performed. This kernel, however, has every block handle a row in the matrices involved in computation. In lines 15-19, the newSink shared variable is populated to hold the new sink node, i.e., process p t . In line 22, we check the RR[] matrix to obtain all of q j 's subtree resources. After finding all of q j 's subtree resources, we assign them the new sink node of p t in line 24. Since p i released resource q j , p i is no longer reachable from q j and is removed from the RP[] matrix in line 30. In line 33, we check if p t 's subtree resources were able to reach q j . If yes, q j is no longer reachable from those resources so we remove q j from the RR[] matrix in line 35. Otherwise, for q j 's subtree resources that were not reachable to p t , p t becomes reachable and the RP[] matrix is updated to reflect the change in line 38.
As a summary, the Release Update Reachability kernel updates all sink nodes in q j 's sub-tree to p t . The process p i that released the resource is no longer reachable from q j and its sub-tree, so the RP[] matrix is updated accordingly. The final steps in our computation require that all p t 's sub-tree resources that were previously able to reach q j be removed from the RR[] matrix. Conversely, if q j 's sub-tree resources were not reachable to p t , p t now becomes reachable and the RP[] matrix is updated.
Supplementary Kernels
One may notice that there were three additional kernels in the overview code that were not discussed so far, the BitMatrix Tranpose, Tile Transpose, and Row Reduction kernels. The combination of the BitMatrix Transpose and Tile Transpose kernels enables us to transpose our bit-vector matrices, which ensures that coalesced global memory accesses occur in our Row Reduction kernels. This is why the BitMatrix Transpose and Tile Transpose kernels always precede the Row Reduction kernel.
While these kernels are supplementary to the GPU-OSDDA functionality, they provide substantial speedups with regard to the run-time of our algorithm. Without performing the bit-matrix transpose that advocates global memory coalescing in kernels, we would have seen a great loss in efficiency (since memory coalescing would not occur) and run-time.
Bit-vector Matrix Transpose Kernels
Performing the transpose of a bit-vector matrix can be a complicated task. Fortunately, this computation has been studied by [13] ; however, the transpose in [13] only works on a 32-bit × 32-bit matrix. Therefore, to make this solution fit to our problem (as our matrices are much larger) and to parallelize the computation, we sub-divided the transpose of our matrices into 32-bit × 32-bit tiles (seen as TILE DIM in Algorithm 1). Then we launch TILE DIM blocks and TILE DIM threads in the BitMatrix Transpose kernel. This ensures in the first step that each thread transposes a 32-bit × 32-bit tile of the matrix, as shown in Figure 6 . After the elements within each tile have been transposed, the Tile Transpose kernel performs an outer tile transpose, i.e., tile by tile as shown in Figure 7 , to place all elements in the correct positions. By performing the transpose of our entire matrix in tiles, we were not only able to parallelize the transpose but also able to enable coalesced global memory accesses for its following kernels, thereby leading to a fast bit-vector matrix transpose operation. Figure 7 . Outer tile bit-matrix transpose.
Bit-vector Row Reduction Kernel
Our Row Reduction kernel also greatly benefits from our bit-vector approach. All of the reductions in GPU-OSDDA are used to check to see if the row or column of a particular matrix is zero. This works well with the bit-vector approach. When we perform our reduction, we simply add each integer of a particular row or column together (an add reduction) and determine if the total is zero or not. This allowed us to compute the row or column reduction 32X faster than if we had not taken the bit-vector approach in storing our adjacency matrices. To optimize our reduction kernels even further, we perform the first add of the reduction when we populate shared memory, thus allowing us to launch half the number of threads required in a standard reduction. From that point, we perform the reduction while unrolling the last warp utilizing the warpReduce function found in [14] .
Experimentation and Results
All experiments were performed on an Intel Core i7 CPU @ 2.8 GHz with 12 GB RAM. The CUDA GPU-OSDDA implementation was tested with different GPUs: Tesla C2050, Tesla K20c, and Titan X. The Tesla C2050 has 14 SMs (448 CUDA Cores) with 3 GB Global Memory, the Tesla K20c has 13 SMXs (2496 CUDA Cores) with 5 GB Global Memory, and the Titan X has 24 SMs (3072 CUDA Cores) with 12 GB Global Memory.
A serial version of GPU-OSDDA is implemented using the C language, referred to as CPU-OSDDA. We attempted to create a multi-threaded version in CPU using OpenMP. However, there was no speedup because the algorithm is not computationally intensive but is bounded by memory read and write. In this case, the overhead of thread management makes the run-time slower. Therefore, we use the serial version CPU-OSDDA to compare the performance with parallel computation.
To verify the correctness of our algorithm, both CPU-OSDDA and GPU-OSDDA were tested using RAGs of different sizes and complexities. To create the RAGs, we prepare a list of events. CPU will send out each event sequentially and launch GPU kernels depending on the event type.
The response time for each type of events is measured. The Resource Request Granted Event has trivial response time due to its simple algorithm. For Resource Blocked Event, we created an event list of a worst-case scenario p 1 → q 1 → p 2 → q 2 . . . → p M → q M → p 1 (2M events in total, where M is the number of processes).
Then the time to update all the state matrices for the last blocked event p M → q M before deadlock happens is recorded in Table 2 . Using NVIDIA Visual Profiler [15] , we measured the communication time and the computation time of the GPUs. As shown in Figure 9 , the larger the input size is, the longer the computation time it takes out of total time. As the communication overhead has less effect, we were able to gain higher speedup for larger input sizes.
In a similar manner, for Resource Release Event, we tested the average-case scenario by modifying the length of resource-process chain which has the same sink to the process releasing the resource. The reason to test for average-case is to provide a big picture on the average speedup of using GPU under typical scenarios. The time is recorded in Table 3 , which shows less speedup than Resource Blocked Event because in CPU-OSDDA, the algorithm itself for Resource Release Event has linear complexity, whereas Resource Blocked Event has polynomial complexity. Figures 8 and 10 depict the associated speedups of each set size on each piece of target hardware. From the result, the increasing size of RAG dramatically increases CPU-OSDDA's run-time. However, GPU-OSDDA scales well with increasing process and resource amounts. Figure 10 . GPU-OSDDA speedup for released event (average-case) against CPU version
Conclusion
A new approach to deadlock detection for single-unit systems on GPU has been devised and developed using CUDA C. By leveraging facts about single-unit systems, we were able to devise a bit-vector technique for storing our matrices, which led to efficient algorithmic computations and drastically saved memory space on the GPU. These factors allow GPU-OSDDA to handle systems with increasing
