Abstract-Embedded electronic devices like mobile phones and automotive control units must perform under strict timing constraints. As such, schedulability analysis constitutes an important phase of the design cycle of these devices. Unfortunately, schedulability analysis for most realistic task models turn out to be computationally intractable (NP-hard). Naturally, in the recent past, different techniques have been proposed to accelerate schedulability analysis algorithms, including parallel computing on Graphics Processing Units (GPUs). However, applying traditional GPU programming methods in this context restricts the effective usage of on-chip memory and in turn imposes limitations on fully exploiting the inherent parallel processing capabilities of GPUs. In this paper, we explore the possibility of accelerating schedulability analysis algorithms on GPUs while exploiting the usage of on-chip memory. Experimental results demonstrate upto 9× speedup of our GPU-based algorithms over the implementations on sequential CPUs.
I. INTRODUCTION
Many embedded devices like mobile phones and automotive control units have to satisfy strict timing constraints. Hence, the design cycles of these devices involve a schedulability or timing analysis phase. Typically, a system designer would choose the values of certain parameters (e.g., processor frequency and task deadlines) and invoke a schedulability analysis tool to determine whether the timing constraints are met. If such an analysis returns a negative answer, then some of the parameters are modified and the analysis is invoked once again. This iterative adjustment of the system parameters is repeated till the performance constraints are satisfied. This process is illustrated in Figure I . However, schedulability analysis for most realistic task models are computationally intractable (NP-hard) [7] . Hence, each iteration takes a long time to run. This critically impacts the usability of the tool in the interactive design sessions. In order to reduce the long running times, in this paper, we propose a technique for accelerating a schedulability analysis algorithm by implementing it on Graphics Processing Units (GPUs). In particular, we show that effectively exploiting the GPU on-chip shared memory can lead to more speedups compared to a straightforward parallel implementation of the algorithm on the GPU.
Motivation for using GPU: Our work has been motivated by the recent trend of applying GPUs to accelerate non-graphics applications. There are many compelling reasons behind exploiting GPUs for such non-graphics related applications (in contrast to using, say, ASIC/FPGA-based accelerators). Firstly, modern GPUs are extremely powerful; high-end GPUs, such as the nVIDIA GeForce 8800 GTX, have a FLOPS rates of around 330 GigaFLOPS, whereas high-end general-purpose processors are only capable of around 25 GigaFLOPS. Secondly, GPUs are now commodity items as their costs have dramatically reduced over the last few years. Hence, the attractive price-performance ratios of GPUs give us an enormous opportunity to change the way computer-aided tools for embedded system design perform, with almost no additional cost. In fact, recent years have seen the increasing use of graphics processing units (GPUs) for different general-purpose computing tasks. These span across numerical algorithms [11] , computational geometry [1] , database processing [2] , image processing [14] . Of late, there has also been a lot interest in accelerating computationally expensive algorithms in the computer-aided design of electronic systems [10] , [9] , [12] using GPUs. Our paper follows this line of work and proposes a novel technique to implement schedulability analysis algorithms on GPUs.
prespecified time interval. On the other hand, the interactive analysis scheme [5] exploits the repetitive nature of the iterative design cycle in order to achieve speedup. When the algorithm is invoked for the first time, the full algorithm is allowed to run, but certain data structures are created and stored. When the algorithm is invoked in successive iterations after modifying a small set of system parameters, these data structures are exploited to partially run the algorithm and still guarantee the correct result. However, this scheme allows only a small number of the system parameters to be changed in each iteration, and hence doesn't scale well for large number of changes or when a completely new design needs to be analyzed.
Recently, Graphics Processor Units (GPUs) were also utilized to improve the running times of the schedulability analysis engine [6] . Unlike the techniques mentioned above, the approach proposed in [6] always gives optimal results and it is not restricted to small number of changes to the parameters to achieve the speed up. However, this technique used a traditional GPU programming model with Cg [13] and OpenGL [16] . This model does not expose the on-chip memory to the programmers and thus, imposes limitations on fully exploiting the inherent parallel processing capabilities of GPUs. In this context, Compute Unified Device Architecture (CUDA) [15] -which is a new parallel computing architecture based on GPUs -seems to be a promising alternative. Unlike the traditional GPU programming model, CUDA GPUs can be programmed using an extension of C and requires no previous expertise in graphics programming. In this paper, we explore the possibility of accelerating system-level design analysis algorithms with General Purpose GPU (GPGPU) programming with CUDA. In particular, we present the execution speed-ups achieved by implementing the schedulability analysis on CUDA as well as the performance enhancements corresponding to the usage of on-chip memory.
We demonstrate our technique using the schedulability analysis algorithm of the recurring real-time task model proposed by Baruah in [4] . We chose this task model because it is especially suited for accurately modeling conditional real-time code with recurring behavior, i.e., where code blocks have conditional branches and run in an infinite loop, as is the case in many embedded applications. We describe the recurring real-time task model and its schedulability analysis in the next section. Section III describes the CUDA architecture and programming model. Thereafter, our proposed technique of accelerating the schedulability analysis using CUDA is described in Section IV and the results are reported in Section V.
II. RECURRING REAL-TIME TASK MODEL As mentioned above, in this paper we consider the recurring real-time task model. A recurring real-time task is represented by a task graph which is a directed acyclic graph with a unique source (a vertex with no incoming edges) and a unique sink (a vertex with no outgoing edges). Associated with each vertex of this graph is its execution requirement ( ), and deadline ( ). Whenever a vertex is triggered, it generates a job which has to be executed for ( ) amount of time within ( ) time units from the triggeringtime. Each directed edge ( , ) in the graph is associated with a minimum intertriggering separation ( , ), denoting the minimum amount of time that must elapse before vertex can be triggered after the triggering of vertex . +1 ), and +1 − ≥ ( , +1 ) for = 1, 2, . . . , − 1. The only exception is that +1 can also be the source and the sink vertex. In this case if there exists some vertex with < in the sequence such that is also the source vertex, then +1 − >= ( ) must be additionally satisfied. The real-time constraints require that the job generated by triggering vertex , where = 1, 2, . . . , , be assigned the processor for ( ) amount of time within the time interval ( , + ( )].
Figure II illustrates an example of a recurring real-time task. In this task, vertex 3 , for instance, has an execution requirement ( 3 ) = 6, which must be met within 10 time units (its deadline) from its triggering time. The edge ( 1 , 3 ) has been labeled 10, which implies that the vertex 3 can be triggered only after a minimum of 10 time units from the triggering of 1 (i.e., the minimum intertriggering separation time). Edges ( 1 , 2 ) and ( 1 , 3 ) from vertex 1 imply that either 2 or 3 can be triggered after 1 . The period of the task (the minimum time interval between two consecutive triggerings of the source vertex) is 50.
A. Task Sets and Schedulability Analysis
A task set = { 1 , 2 , . . . , } consists of a collection of task graphs, the vertices of which can get triggered indepen-dently of each other. A triggering sequence for such a task set is legal if and only if for every task graph , the subset of vertices of the sequence belonging to constitute a legal triggering sequence for . In other words, a legal triggering sequence for is obtained by merging together (ordered by triggering times, with ties broken arbitrarily) legal triggering sequences of the constituting tasks.
The schedulability analysis of a task set is concerned with determining whether the jobs generated by all possible legal triggering sequences of can be scheduled such that their associated deadlines are met. In this paper, we assume earliest deadline first (EDF) based preemptive uniprocessor schedules. However, all results presented here can be extended to other scheduling policies (e.g., fixed-priority) as well.
A demand bound criteria-based schedulability analysis states that a task set is schedulable if and only if ∑ ∈ .
( ) ≤ for all 0 < ≤ max . For any given task , the function .
( ) is referred to as the demandbound function. It takes as an argument a positive real number and returns the maximum possible cumulative execution requirement of jobs that can be legally generated by and which have their ready-times and deadlines both within a time interval of length . It can be proved that
where ( ) is the maximum cumulative execution requirement arising from a sequence of vertices on any path from the source to the sink vertex of the task graph (see [4] for details). The schedulability analysis algorithm therefore involves two steps.
( ) for all ≤ max and ∈ , and
For the recurring real-time task model, it turns out that computing .
( ) for any is NP-hard (see [7] ) and therefore forms the computationally intensive kernel of the schedulability analysis algorithm. In what follows, we outline a dynamic programming (DP) based algorithm for computing . ( ) for any task graph and time interval length . For details on this algorithm, we refer the reader to [8] . In Section IV, we describe our approach to reformulate this algorithm in order to implement it effectively using CUDA.
B. Computing the demand-bound function
In this section we present a dynamic programming algorithm for computing the demand-bound function .
( ) for any task graph . For any task graph , computing the value of .
( ) for some (large) value of ≤ max might involve multiple traversals (loops) through the task graph. It was shown in [4] that if for a task graph , .
( ) is known for all "small values" of then it is possible to calculate, from these, the value of .
( ) for any . "Small values" of for a task graph are those for which the sequence of vertices that contribute towards computing .
( ) contain the source vertex at most once. The value of .
( ) for larger values of is made up of some multiple of ( ) plus .
( ′ ) where ′ is "small" in the sense described above. It follows that .
( ) for any can be computed as follows (for a 8: 
more detailed description, refer to [4] )
To compute . ( ) for "small" values of , [4] constructs a new task graph by taking two copies of the task graph of and adding an edge from the sink vertex of the first graph to the source vertex of the second and finally replacing the source vertex of the first with a "dummy" vertex with execution requirement and deadline equal to zero. The intertriggering separations on all edges outgoing from this source vertex is also made equal to zero. .
( ) for all values of are then calculated by enumerating all possible paths in this new graph. For arbitrary task graphs, this incurs a computation time which is exponential in the number of vertices in the task graph.
We first outline an algorithm for computing the demandbound function of a task graph for "small values" of . Using this, we then compute the demand-bound function for any value of as explained above.
Given a task graph , let ′ denote the graph formed by joining two copies of by adding an edge from the sink vertex of the first graph to the source vertex of the second and replacing the source vertex of the first copy by a "dummy" vertex. The newly added edge is labeled with an intertriggering separation of = ( ). Now we give a pseudo-polynomial algorithm based on dynamic programming, for computing ′ .
( ) for values of that do not involve any looping through ′ , i.e., we consider only "one-shot" executions of ′ . Let there be vertices in ′ denoted by 1 , 2 , . . . , , and without any loss of generality we assume that there can be a directed edge from to only if < . Following our notation above, associated with each vertex , is its execution requirement ( ) which here is assumed to be integral, and its deadline ( ). Associated with each edge ( , ) is the minimum intertriggering separation ( , ).
Let , be the minimum time interval within which the task ′ can have an execution requirement of exactly time units due to some legal triggering sequence, considering only a subset of vertices from the set { 1 , 2 In this section, we provide a brief description of CUDA [15] . CUDA abstracts the GPU as a powerful multi-threaded coprocessor capable of accelerating data-parallel, computationally intense operations. The data parallel operations, which are similar computations performed on streams of data, are referred to as kernels. Essentially, with its programming model and hardware model, CUDA makes the GPU an efficient streaming platform. Below, we discuss CUDA's programming and hardware model, followed by a short insight into the memory access latencies of the GPU.
A. Programming Model
In CUDA, threads execute data parallel computations of the kernel and are clustered into blocks of threads referred to as thread blocks. These thread blocks are further clustered into grids. During implementation, the designer can configure the number of threads that constitute a block as well as the number of blocks that constitute a grid. Each thread inside a block has its own registers and local memory. The threads in the same block can communicate with each other through a memory space shared among all the threads in the block and referred to as Shared Memory. However, an explicit communication and synchronization between threads belonging to different blocks is only possible through GPU-DRAM. GPU-DRAM is the dedicated DRAM for the GPU in addition to DRAM of the CPU. It is divided into Global Memory, Constant Memory and Texture Memory. We note that the Constant and Texture Memory spaces are read-only regions whereas Global Memory is a read-write region. Figure 3 illustrates the above described CUDA programming model. Note that in contrast to the GPU-DRAM the Shared Memory region is a on-chip memory space.
B. Hardware Model
CUDA hardware architecture is implemented as a set of SIMD (Single-Instruction-Multiple-Data) multiprocessors with on-chip memory. Each of these multiprocessors also consist a set of registers. The thread blocks (described in the previous subsection) are executed on these multiprocessors such that each multiprocessor executes one or more thread blocks through time slicing. However, each thread block is processed by a single multiprocessor in order to facilitate communication between different threads in a block through on-chip memory. Thus, the on-chip memory of a multiprocessor forms the Shared Memory space of the thread block and is typically in the order of KB.
C. Memory Access Latencies
A typical memory instruction in CUDA issued by a multiprocessor consumes 4 clock cycles. However depending upon the memory space where the memory location that is being accessed resides, there will be additional latencies. In case the memory location being accessed resides in GPU-DRAM, i.e., either in Global, Texture or Constant Memory spaces, the memory instruction consumes an additional 400 to 600 cycles. On the other hand, if the memory location resides on-chip in the registers or Shared Memory, then there will be almost no additional latencies in the absence of memory access conflicts. These additional latencies might obscure the speedups that can be achieved due to parallelization and hence the on-chip shared memory must be judiciously exploited.
IV. SCHEDULABILITY ANALYSIS USING CUDA In order to accelerate the schedulability analysis algorithm (Algorithm 1) described in Section II using CUDA, there are two broad challenges. Firstly, we need to identify and isolate the data parallel computation of the algorithm so that they may be compiled as the kernels. These kernels must be then mapped to CUDA thread blocks. Secondly, one has to efficiently exploit the on-chip Shared Memory to enhance the achievable speedups. In light of these two challenges, we now provide a systematic implementation of Algorithm 1.
As mentioned above, our first goal is to identify the data parallel portions (kernels) that can be computed in a SIMD fashion using CUDA threads. The kernels must not have any data dependencies (on each other) because they will be executed by threads running in parallel. Towards this, we first identify the data dependcies in Algorithm 1. Algorithm 1 (lines 6−10) essentially builds a dynamic programming (DP) matrix. The +1-th row in the matrix corresponds to vertex +1 of the task graph described in Section II. Each of the cells in row +1 consists of +1, and
+1
+1, values where = 1, 2, . . . , . According to Algorithm 1 (line 8), the computation of these values in the cells of the +1-th row depends only on the values present in the previously computed rows. This implies the values of the cells of the same row in the DP-based matrix can be computed independently of each other by using different CUDA threads in a SIMD fashion. Therefore, we segregate this task (lines 8 and 9 of Algorithm 1) as the kernel of our CUDA implementation.
We store the DP-matrix in Global Memory space, i.e., in GPU-DRAM. Note that we use Global Memory space instead of Constant or Texture Memory because Constant and Texture Memory are read-only regions. During the computation of our DP-based matrix we need to perform both read (to fetch values from previously computed rows) and write (to update the DPmatrix with the values of the row computed in the current iteration) operations which can only be done explicitly with Global Memory. Also, note that, we have not used the onchip Shared Memory because the size of the Shared Memory is typically quite small (see Section III) and the entire DPmatrix cannot fit into it.
However, the on-chip Shared Memory can be exploited to store other frequently accessed data structures. To identify such data structures, we once again focus on the kernel operations of our algorithm (lines 8 and 9 in Algorithm 1). We note that the computation of +1 +1, and +1, values of vertex +1 (i.e., the + 1-th row of the DP-matrix) needs certain values from the vertices 1 , 2 , . . . , , where +1 has an incoming edge from each of these vertices. Let us denote the tuple { ( , +1 ), ( )} as Ω , . Thus, from the lines 8 and 9 of Algorithm 1, the computation + 1-th row of the DPmatrix requires the values Ω ,1 , Ω ,2 , . . . , Ω , .
The set, {Ω ,2 , . . . , Ω , }, is essentially a subset of the overall specification of the task graph. Also, in iteration + 1 of computing the DP-matrix this set of required data structure remains constant, i.e., information about the other parts of the task graph is not required. This set changes only at the next iteration because it corresponds to a different vertex which might have a different set of incoming edges. This observation provides an opportunity to significantly reduce the GPU based execution times by loading these values {Ω ,2 , . . . , Ω , } to the on-chip Shared Memory at the beginning of each iteration. Compared to the DP-matrix, this set of values is much smaller and can fit into the on-chip shared memory. Figure 4 illustrates our scheme of prefetching the required data structure from Global Memory to Shared Memory at the start of each iteration. The figure shows a thread block (which consists of 32 threads) fetching the required data from the Global Memory at the + 1-th iteration.
We recall from Section III that the on-chip memory is shared only between the threads within a single block. Hence, configuring the thread blocks to an appropriate size is also important to effectively exploit the GPU on-chip memory. For example, on one hand, if we choose a very small thread block size, then the computation of each row in our DP-based matrix will involve lot of thread blocks. However, only the threads within a thread block share the same chunk of onchip memory. This implies that data from the Global Memory to Shared Memory will have to be transfered for a large number of thread blocks, inspite of the fact that, all the threads in a single iteration need the same data structures -{Ω ,2 , . . . , Ω , }, as described above. This in turn obscures the speedups that can be achieved. On the other hand, one can choose a thread block of very large size which has a lot of threads in each block. In this case each thread block will be executed on a single multiprocessor via time slicing because the number of processors on multiprocessor are typically around 8, while there are thousands of threads running in parallel. This time slicing will also inhibit the acceleration that can be achieved. To strike the right balance with the size of the thread block, we perform experiments with different thread block sizes. Note that CUDA allows thread block sizes only as powers of 2. Hence the design space for exploration is not huge and it is possible to find the right size in reasonable time. Moreover, after a threshold size the performance deteriorates or remains constant and hence it is not necessary to experiment with larger sized thread blocks beyond this threshold.
V. EXPERIMENTAL RESULTS
For our experiments, we compared three different implementations of the the schedulability analysis algorithm -on the CPU (as described in Algorithm 1), on GPU without using shared memory and on GPU while exploiting shared memory (as described in Section IV). We randomly generated synthetic task graphs consisting of 10, 20, 30, 40 and 50 vertices respectively. The value of which represents the maximum possible execution time for a vertex was set to 10, 000. It may be noted that the execution requirement associated with any vertex of a graph is expressed in terms of time units. Such time units depend on the application at hand and might denote milliseconds, microseconds, or even the number of clock cycles of the processor on which the task graphs are required to execute. Hence, experiments with values of like 10, 000 are realistic.
All the experiments were conducted on a machine with 3.0 GHz Intel Pentium 4 CPU and 1 GB RAM running Windows XP. The machine was equipped with a nVIDIA GeForce 8800 GTX GPU, which was used for our GPU based implementations. The code has been implemented and compiled with release mode in Microsoft Visual Studio 2005 which had CUDA Toolkit 1.1 integrated into it. In our experiments, we measured the execution time of the three different implementations for each of the tasks graphs. Table I presents the exact values of these measurements for task graphs of different sizes. The first column corresponds to the execution time on CPU referred to as 'CPU Time'. The second column corresponds to the execution time on GPU without any optimizations involving exploitation of Shared Memory space and is referred to as 'GPU Time'. The third column referred to as 'GPU-SM Time' corresponds to the execution time on GPU as well but with the exploitation of Shared Memory as described in Section IV. Figure 5 provides a visual representation of these speedups. As mentioned in Section IV, we experimented with different sizes of thread blocks. We observed a performance enhancement (i.e., speed ups) as we increased the thread block size upto 512. However, with larger thread blocks (1024 and 2048) there were no further speed ups. Thus, 512 was the best size for the thread blocks and the 'GPU-SM Time' reported here correspond to experiments with thread block size 512. From our results, we observe that as we progressively increase the task graph size from 10 to 50 the GPU implementation without Shared Memory provides a speedup of 5.2× on average and a maximum of 6× compared to CPU. The GPU implementation with Shared Memory, i.e., GPU-SM attains a maximum speedup of 9× (for a task graph with 50 vertices) and an average of 7.4× speedup compared to the execution times on CPU. It is noteworthy that the GPU-SM implementation provides upto 1.86× acceleration compared to a GPU implementation without shared memory utilization.
From these results it is evident that as the size of the task graph increases, memory access latencies can degrade the speedups that can be attained by GPU implementation. This can be noticed from the measurements corresponding to task graphs of sizes 10 and 50. In case of graph with 10 vertices, the GPU and GPU-SM implementations provide 4.2× and 5.3× speedups respectively. However, in the case graph with 50 vertices these speedup values are 4.8× and 9×, respectively for GPU and GPU-SM. These numbers, therefore, indicate that for GPU-based implementations, exploiting onchip Shared Memory is an effective method to overcome performance degradations caused by memory access latencies.
It may be noted that all our implementations were stand- alone C codes and they did not make use of any graphical interfaces for specifying the task graphs. Instead, the code was specifically optimized for running the schedulability analysis. In practice, a design tool supporting schedulability analysis would be more involved. More specifically, the task graphs might be integrated with other application-specific data structures that are not optimized for the schedulability analysis algorithm. In such cases, the speedups obtained by our interactive schedulability analysis might be considerably higher compared to the results reported here.
VI. CONCLUSION
In this paper, we presented a technique to implement a computationally expensive schedulability analysis algorithm on GPUs. Our proposed method can effectively utilize the onchip shared memory, which was not possible with previously proposed techniques.
