Abstract-Modern real-time embedded systems often require the capability of switching between operating modes to adapt in dynamically changing environments. The development of such real-time multi-modal systems fundamentally relies upon effective schedulability analysis. Recently, researchers have proposed serial schedulability analysis algorithms for multi-modal systems that account for mode changes at both software level (e.g., changing the set of executing tasks) and hardware level (e.g., changing the operating speed of a processor). However, these algorithms have high runtime complexity which limits their practical usage as schedulability analysis in system design-space exploration. In this paper, we design a parallel algorithm as an efficient solution to the problem of determining the schedulability of uniprocessor multi-modal real-time systems scheduled by EDF. By emphasizing a balanced workload distribution and restricting the number of synchronizations, our parallel algorithm achieves a nearperfect speedup observable both theoretically and experimentally. Experimental results show that the runtime of our parallel algorithm is very low even for systems with large number of modes, making it a tractable choice for design-space exploration of real-time multi-modal systems.
I. INTRODUCTION
Current embedded system design has increasingly utilized multiple operating modes to allow a higher degree of dynamism. GPS receivers, smart phones, and wireless sensors utilize multi-mode executions to prolong their battery life. Implementation of these multiple modes can be found in software, hardware, or both. Most modern processors typically have Dynamic Power Management (DPM) capabilities such as clock throttling, clock gating, or dynamic voltage and frequency scaling that provide multiple hardware execution modes. At the same time, a software system may change its timing requirements to adapt to different resource constraints (e.g., an adaptive video-streaming application adjusts its processing rate and decoding quality based upon processing/network availability). Schedulability analysis is an important design step for ensuring predictable timing behavior of real-time systems. Unfortunately, most prior research on multi-modal schedulability analysis focuses primarily on software mode changes which make it unsuitable for real-time control systems that require support for both hardware and software mode changes (e.g., real-time thermal-control system by Hettiarachchi et al. [8] ). For such real-time control systems, a larger number of modes is typically desirable to permit greater adaptability in unfavorable environments. Checking schedulability of multi-modal systems warrants higher computation time as a result of dependencies (due to transitions) between modes; therefore, a large number of modes pose a computational challenge for existing algorithms for sequential schedulability analysis techniques. Thus, parallel schedulability analysis is a promising and practical alternative to traditional techniques.
An efficient parallel schedulability analysis can also reduce significantly the time for design-space exploration [22] that may utilize schedulability tests for determining optimized resource parameters of a multi-mode system. Schedulability analysis using parallel algorithms is a relatively unexplored area for multi-modal real-time applications. For uni-modal systems, there have been solutions with well defined sets of conditions where each condition must pass a set of test cases. In most scenarios, the evaluation of these test case elements can be performed independently. From the perspective of parallel computing, this independent execution behavior makes the problem of uni-modal schedulability less challenging. However, the schedulability analysis of multimodal real-time systems is complex; the analysis not only depends on each mode itself, but also on the schedulability of all other modes along with mode change sequences. Therefore, recently developed pseudo-polynomial schedulability analysis by Fisher and Ahmed [6] may be a workable solution for the schedulability of systems (given a set of system parameters). However, a sequential implementation of their analysis may not sufficiently scale to be used as an effective tool for determining optimal system parameters. In this paper, we address a fundamental gap in the research literature on parallel schedulability analysis algorithms suitable for design-space exploration (e.g., minimizing the total aggregate hardware resources over all modes) for real-time multi-modal systems. Contribution & Organization. Our proposed solution is the first non-trivial parallel schedulability analysis algorithm for real-time multi-mode systems. We provide a parallel algorithm for evaluating the schedulability conditions developed by Fisher and Ahmed [6] . Our contributions are as follows:
• To achieve an ideal workload distribution, we adopt the most suitable and relatively simple workload distribution policy considering the nature of the conditions (Sec-tion IV). A balanced distribution of workload (without introducing overhead) is the key for achieving better speedup.
• To achieve a near-ideal speed-up, we design the algorithm such that the communication and synchronization overhead is minimized.
• To characterize the effectiveness of the proposed algorithm, we determine parallel performance metrics (e.g., speedup, efficiency, and cost) and derive conditions to obtain better speedup (Section V).
• To substantiate the performance of the proposed algorithm, we perform experiments upon a cluster of AMD Opteron computers (Section VI). We obtain high parallel efficiency (over 90%) which establishes that the proposed algorithm can be used as an efficient schedulability test for design-space exploration of real-time multi-mode systems.
II. RELATED WORK
Research on uniprocessor multi-mode real-time systems can be divided into two categories: fixed-priority scheduling [11] , [13] , [20] and EDF scheduling [12] , [17] . Tindell et al. [21] introduced a simple protocol where newly-added tasks wait after a mode change until the processor completes the oldmode tasks. In another publication, Tindell et al. [20] defined equations for calculating the waiting time after which a newmode task can generate jobs. Pedro and Burns [11] and Real and Crespo [13] explored protocols where old-mode tasks may execute concurrently with new-mode tasks. Mechanisms such as constant-bandwidth server (CBS) [1] , sporadic server [16] , periodic resource model [15] , elastic scheduling [4] , and ratebased earliest deadline (RBED) [3] also permit a system to change its software/hardware requirements adaptively; however, each of these server-based mechanisms does not guarantee deadlines during transitions between modes. An adaptive hard-real-time extension of CBS, called variable-bandwidth server (VBS) [9] has been developed; however, VBS does not consider hardware-level mode changes.
Recently, real-time calculus (RTC) [19] has been increasingly used to analyze multi-mode real-time systems. Using RTC, Stoimenov et al. [17] , [18] and Santinelli et al. [14] investigated schedulability of hardware modes and software modes separately; however, scenarios where both hardware and software must change modes were not addressed. One set of recent results by Phan et al. [12] has addressed multiple hardware and software-level modes and proposed a sequential algorithm for schedulability analysis. However, the proposed algorithm requires a reachability search of a state-space graph and does not scale efficiently with the number of modes. Fisher and Ahmed [6] developed invariants that a schedulable multimode system must satisfy; however, the proposed solution still requires pseudo-polynomial time. We exploit these schedulability conditions to develop an efficient parallel schedulability analysis of a multi-modal real-time system.
III. MODELS & DEFINITIONS
We consider a multi-mode system M with q number of modes. Each mode is scheduled by the Earliest-DeadlineFirst (EDF) scheduling algorithm. A mode M (i) , where i ∈ {1, . . . , q} is the mode index, is associated with the real-time workload and the minimum processor execution guaranteed by the processing resource. The real-time workload τ (i) of mode M (i) is modeled by the sporadic tasks model [10] , and the processing resource that guarantees the real-time execution of τ (i) is modeled by the explicit-deadline periodic (EDP) resource model [5] . Throughout this paper, we assume that each of the system parameters are integers. Sporadic Task Model. A sporadic task system τ
ni } is a collection of n i sporadic tasks where each task time units after its arrival. We consider constrained deadline tasks; that is,
and the task system utilization
To quantify the maximum workload over an interval of length t > 0, we consider the demand-bound function dbf(τ (i) , t) which quantifies the maximum cumulative execution requirements of all jobs of τ (i) that could have both the arrival time and the deadline in any interval of length t. Baruah et al. [2] have shown that, for a sporadic task τ (i) , dbf(τ (i) , t) can be calculated as follows:
In Figure 1 , the dotted line depicts dbf(τ
1 , t). The horizontal axis represents the interval length and the vertical axis represents the execution requirement. We denote by dbf(
, t) the demand-bound function of a sporadic task system τ (i) . Explicit-Deadline Periodic (EDP) Resource Model. The EDP resource model [5] , [15] is a general resource model for Fig. 2 . Execution pattern of a multi-modal real-time system. The shaded areas indicate times during which tasks of each mode execute on the processor.
characterizing the execution of a system upon a periodicallyavailable, non-continuously-executing resource. The hardware processing resource available to each mode M (i) is represented by an EDP resource is guaranteed to receive from Ω (i) over any interval of length t ≥ 0. The solid line in Figure 1 presents sbf(Ω (i) , t) for Ω (i) . Easwaran et al. [5] have quantified sbf(Ω (i) , t) as follows:
where
Real-Time Mode Change Model. We now describe the discrete hardware/software real-time multi-mode model [6] . Each mode is specified by a three-tuple τ (i) , Ω (i) , N (i) which respectively characterizes the real-time workload generated by a sporadic task system, the minimum processor execution guaranteed by an EDP resource, and the minimum mode duration in terms of "number of resource periods" N (i) . The interpretation of N (i) is that the system remains in mode M executing after t k (where i, j ∈ {1, . . . , q}). We assume that if i < j then mcr i occurs prior to mcr j . Mode-change request
Tasks may be divided into groups based on their importance at the time of a mode-change request
. Some important tasks may need to continue to execute without being affected by the mode-change request (MCR). We call these tasks unchanged tasks denoted by τ (ij) . Some less important tasks may be removed from the system immediately at the time of a MCR. We call these tasks aborted tasks and denote them by α (ij) . For some tasks, immediate termination may leave the system in an inconsistent state; we call such tasks finished tasks and characterize them as being members of the set
). We allow a job from a finished task at the time of MCR to complete its remaining execution.
In order to facilitate quick changes of modes, the system designer may allow a transition period, called the offset, which we denote by δ ij (see Figure 2 ). During the transition period after mcr k , only the jobs from unchanged tasks and the last generated job from the finished tasks are permitted to execute. At t k + δ ij and after, task system τ (j) may generate and execute jobs along with any remaining execution of jobs from τ (i) . At last, there may be some tasks that are common in both modes, but have some properties changed; we treat these tasks as finished tasks in the old mode. During the transition period after mcr k , the system designer may provision a different resource Ω
) to achieve a quick mode change response. We assume that the offset δ ij is some multiple of Π (ij) . Given the above definitions, we may distinguish three phases with respect to a mode-change
, t k ) (and the previous request
and old-mode jobs of τ
In addition, unchanged tasks (τ (ij) ) act independently of a mode-change request during [t k−1 + δ hi , t k+1 ). The above task classifications follow the taxonomy by Real and Crespo [13] . In this paper, we restrict the mode-change requests and transition intervals to occur only at period boundaries of the EDP model. This assumption is natural for control systems where mode changes only at sampling periods. That is, for any mode-change request
, and for any two successive mode-change requests
we require that
for some a ∈ N + , where a ≥ N (i) . We also assume that a non-aborted job may span no more than one mode-change request. In this paper, we consider the following problem.
EDF-Multi-Mode-Sched Problem: Given modes
, and aborted tasks α (ij) for all i, j ∈ {1, . . . , q} (i = j), determine whether all jobs, under all legal job arrival sequences and all possible legal mode-change requests according to Equation 3 are EDF-schedulable. To solve this problem, Fisher and Ahmed [6] developed schedulability conditions which guarantee that the maximum demand over any interval is always less than the minimum supply received for that interval. For taking mode changes into account, mode-change DBF and SBF functions should be defined with respect to any
In the rest of the paper, we make use of an indicator function μ ≥0 (x) which is zero if x < 0 and is one otherwise; we will also use the notation (x) + def = max(0, x).
A. Mode-Change SBF Definition 1 (Pre-MCR-SBF):
The function sbf prior (M (i) , t) is the minimum execution guaranteed by
Definition 3 (Post-MCR-SBF):
is the minimum execution guaranteed by Ω (ij) and
For each of the above defined functions, Fisher and Ahmed [6] derived upper bounds as follows:
where a
B. Mode Change DBF Definition 4 (Carry-In Execution): The carry-in execution
to any other mode M (j) at time t k is an upper bound on the maximum possible remaining execution (over any legal sequence of MCRs) of non-aborted jobs from mode M (i) for tasks τ (i) \ {τ (ij) ∪ α (ij) } at time t k +δ ij that arrive prior to t k and the maximum total execution of unchanged tasks (i.e., τ (ij) ) that have arrival before
, and M (j) (i = j), the function ci g is inductively defined as follows (according to Fisher and Ahmed [6] ):
and
The convergence of the above sequence occurs at the smallest g ∈ N such that
represents the maximum carry-in from M (i) to M (j) if g − 1 mode changes have previously occurred. The maximum carry-in may be bounded by the total execution of jobs that are not aborted or are unchanged at the mode change from
The maximum carry-in is also bounded by the demand generated by jobs of M (i) accounting for the maximum carry-in from some previous mode change from
The functions F ij and Ψ (Equations 9 and 10) are used to calculate the demand "carried-in" from M (i) , as formalized in the next definition.
at time t k and φ ∈ R ≥0 is the maximum remaining execution (over any legal sequence of MCRs prior to t k ) of jobs of tasks τ (i) \ α (ij) that arrive prior to t k (or prior to t k + δ ij for τ (ij) tasks) and have deadlines in the interval
It may be shown [6] 
over any interval of length φ.
C. Schedulability Conditions
Using the definitions provided in the previous sub-sections, Fisher and Ahmed [6] developed schedulability conditions for multi-mode systems as follows. Over any possible (legal) sequence of mode-change requests, the system is EDFschedulable, if the following five conditions hold for any two distinct modes M (i) and M (j) ,
SC 2 :
SC 3 :
SC 5 :
, and T ij are each a finite set of consecutive positive integers starting from one. Each of these sets are commonly referred as a testing-set. We use a generic notation of SC Z (i, j, φ) where Z ∈ {1, . . . , 5} for the superscript of the testing sets. For example, the SC 1 (i, ∅, ∅) is the superscript for T SC 1 (i) which is the testing set of SC 1 in Equation 12 . The last two parameters in this example have the value of ∅ as they are not used by SC 1 . The largest integer of each testing set T SC Z (i,j, ) and T ij is finite and determined by Fisher and Ahmed [6] . Now, we provide intuitive explanations for each condition of Equation 12 . Before missing a deadline by an EDF-schedule, the processor is continuously busy. This interval is known as a busy interval. In the busy interval, the resource demand is greater than the processing supply. Five conditions are used to avoid busy intervals where resource demand is greater than the supply taking mode changes into account. Fisher and Ahmed [6] identified the five different kinds of busy intervals with respect to a mode-change request, which are depicted in Figure 3 . SC 1 ensures the schedulability of an individual mode. SC 2 and SC 3 ensure that an individual mode is schedulable along with the demand from unchanged tasks of the old mode after a mode change. SC 4 ensures the schedulability during the transition period while accounting for the carryin demand from the non-aborted jobs and the mode-change supply function. Similarly, SC 5 ensures schedulability after a transition. The last two conditions account for the carry-in from all past mode-change requests through the ci function while analyzing demand of each individual mode.
IV. PARALLEL ALGORITHM
We design a parallel algorithm for solving the EDF-MultiMode-Sched Problem. We first determine the complexity of a serial algorithm for checking all five conditions to realize the size of the problem to be parallelized. The runtime complexity depends on the total aggregate size of all the testing sets. Furthermore, conditions SC 4 and SC 5 require evaluating the ci function to account for the carry-in execution. Fisher and Ahmed [6] showed that the ci(
, . . . , q}, can be calculated in a finite number of iterations which is equal to the summation of execution requirements of all tasks. We define C as the maximum of the following three values: 1) the summation of the execution requirements of all tasks over all modes, 2) the maximum transition period (i.e., maximum δ ij ) and 3) the maximum of the sizes of the sets T SCZ (i,j,φ) and T ij for any i, j, φ and Z. The calculation of the ci(M (i) , M (j) ) for each pair requires at most C inductive computations of ci g where each such computation would invoke the Ψ x function at most C times. As there are q(q − 1) pairs of modes, a serial function for calculating the carry-in for all pairs would require O(q 2 nC 2 ) time, where n is max q i=1 {n i }; the term n is due to the calculation of demand (dbf) for every testing set element. The complexity of checking all five conditions is dominated by the complexity of checking condition SC 5 which is also O q 2 nC 2 ; therefore, the complexity of the serial schedulability analysis is O q 2 nC 2 . This pseudo-polynomial complexity could be quite large as C is potentially exponential in the representation of the multimodal system. Thus, it is desirable to decrease the analysis time by parallelizing the schedulability analysis.
A. Parallel Platform
We consider a parallel message-passing system composed of m identical processors, P = {P 1 , P 2 , . . . , P m } where the subscript i ∈ {1, . . . , m} for each processor P i denotes the unique identifier (frequently denoted as rank) in the platform. The design of our parallel algorithm considers the data parallel model in which the total workload (testingset elements) is statically mapped onto processors and each processor performs similar operations on different testingset elements. For communication/synchronization, we use the parallel message-passing construct All-to-All-Reduction [7] by which all processors simultaneously involve in a communication/synchronization operation. The All-to-All-Reduction uses an associative operator (e.g., MAX, SUM, OR) to accumulate and combine the data from the buffer of each processor into a single piece of data which is then replicated at all processors.
The performance of a parallel algorithm depends heavily on the underlying workload distribution policy. Balanced distribution of workload along with the minimal overhead due to communication/synchronization is indispensable to reduce the parallel execution time. In the next sub-section, we describe the workload distribution that allows us to obtain a completely balanced workload distribution without any communication/synchronization overhead. Then, we present the parallel algorithm for schedulability analysis and finally characterize its theoretical performance.
B. Workload Distribution
We develop policies to distribute elements of each testing set T SCZ (i,j,φ) among the processors for the parallel algorithm. Our approach emphasizes a balanced distribution of workload among the processors to obtain a near-ideal speedup. Since the workload for the parallel algorithm is entirely dependent upon the testing sets for the schedulability conditions of Equation 12 , a naive approach is to assign elements of each testing set T SC Z (i,j,φ) in a round-robin fashion to each of the processors; i.e., processor P 1 would test schedulability condition SC Z for the first testing-set element, processor P 2 would test SC Z for the second testing-set element, and so on. In general, after evaluating a testing set element t i , processor P k will skip the next m testing set elements which implies that each processor will work with a single element among the m consecutive members of T SCZ (i,j,φ) . Thus, the total number of elements to be checked by each processor is at least
and the maximum difference in workload between two successive processors is one testing element. Now consider the next testing set; again, if the first processor P 1 tests the first element, then, in the worst case this processor may receive one more testing set element than the other processors. Therefore, at the end of checking condition SC Z (i, j, φ) of Equation 12 , there may be a difference in workload of one testing element among processors. At the end of the execution, this difference could be equal to the total number of testing sets for all five conditions (which is Cq 2 ). This uneven workload will reduce the speedup of the parallel algorithm. Thus, in our approach we do not always allow the first processor P 1 to test the first element of the testing set, rather we keep track of the processor P k that tests the last element in previous testing set. Then, we allow the next processor P k+1 to test the first element in the current testing set. To support the equal distribution of testing set elements, each processor maintains a root distribution variable r k , where r k ∈ {1, 2, . . . , q}. The variable r k indicates the starting element for P k for the next testing set. The set below represents the subset of elements of T SCZ (i,j,φ) for which P k is responsible for testing the condition SC Z while P k 's root variable is r k .
We now consider how to update the root variables to ensure a completely-balanced distribution of the testing set elements. For processor P k with root variable r k , we can determine the processor that has r equal to one for some ∈ {1, . . . , m}. The expression r k def = ((k − r k ) mod m) + 1 identifies the rank of this processor. By distributing each of the elements of T SC Z (i,j,φ) in a round-robin fashion (as described in Equation 13), the first processor to receive an element has rank equal to def = r k + |T SCZ (i,j,φ) | − 1 mod m + 1. Rank identifies the processor that will receive the first element in the next testing set distribution. Thus, for any other processor P k to determine its new root variable, we must calculate ((k − ) mod m) + 1. Thus, we must use the following update rule:
After distributing the entire workload of all testing sets according to this rule, the difference between any two processors with respect to the number of testing set elements assigned is at most one. In addition to a completely-balanced workload distribution, we observe that the testing set elements do not need to be distributed (via communication or initialization) to the processors. In fact, since the testing set simply consists of consecutive integers, each processor independently generates testing set elements as needed, according to the set defined in Equation 13 . Thus, the proposed distribution eliminates the communication overhead due to the workload distribution.
C. Algorithm Description
We now present the pseudocode for ParallelSA, our proposed parallel algorithm for schedulability analysis, in Algorithm 1. The algorithm is designed to run concurrently on all available processors. The ParallelSA uses two subroutines Validate and MaxCarry. The algorithm starts with the initialization of the parallel execution. The rank of the processor (denoted by k) and the total number of processors are determined at this point. A data distribution root r k , associated with each processor P k , is initialized to the unique rank k of the processing platform. The algorithm then starts checking each testing set T SC Z (i,j,φ) of condition SC Z (i, j, φ) for all legal values of i, j and Z using the function Validate(r k , Z, i, j, φ, C i ) at each processor P k . If the function Validate returns true for the current testing set, the function will continue its execution to the next testing set; otherwise, the algorithm returns false; that is, the multi-mode real-time system M is not schedulable.
Algorithm 1 ParallelSA(M)
1: {Processor k executes:} 2: r k ← Initialize() 3: for i = 1 to q do 4: if Validate(r k , 1, i, ∅, ∅, ∅) = false then 5: return false 6: end if 7: for j = 1 to q (j = i) do 8: if Validate(r k , 3, i, j, ∅, ∅) = false then 9: return false 10:
end if 11: for s = 0 to δij do 12: if Validate(r k , 2, i, j, s, ∅) = false then 13: return false 14: end if 15: end for 16: end for 17 : end for 18: ζ ← MaxCarry(M, k, r k ) 19: for i = 1 to q do 20: Ci ← max {h=1,...,q}∧h =i {ζ hi }
21:
for j = 1 to q (j = i) do 22: for φ = 0 to δij do The evaluation of inequalities related to condition SC 1 of Equation 12 is performed in Lines 4 to 6. The condition SC 1 is evaluated for each of the q different modes. All remaining four conditions of Equation 12 are defined for pairs of modes; therefore, we use two nested for-loops to iterate through the testing sets associated with each such pair of modes. However, we separate the code segment related to conditions SC 4 and SC 5 (Lines 19 to 33) from the rest as the former two conditions require pre-computed carry-in executions calculated by the MaxCarry function (Line 18). The function MaxCarry could potentially be invoked at the beginning of the algorithm; in that case, all five conditions could be evaluated using one single block of nested loop. However, the function MaxCarry is a costly operation, and we allow its execution only if it is required. For unschedulable systems, it may be the case that the system will not satisfy one of the first three conditions: SC 1 , SC 2 , or SC 3 ; therefore, there is no need of invoking the costly MaxCarry for such unschedulable systems.
The algorithm Validate is a case-based implementation for evaluating each condition SC Z (i, j, φ) using the condition variable Z. Depending on the value of Z ∈ {1, 2, 3, 4, 5}, the function selects the appropriate schedulability condition for each testing set element x ∈ T SCZ (i,j,φ) k,r k . The per-processor testing set T
is decided by its current data-distribution Algorithm 2 Validate(r k , Z, i, j, φ, C i ).
result ← false; break; 6: end if 7 :
result ← false; break; 10: end if 11 :
if dbf(τ (ij) , x) > sbf(Ω (ij) , x) then 13: result ← false; break; 14: end if 15 :
result ← false; break; 18: end if 19 :
result ← false; break;
23:
end if
24:
end if 25: end for 26: Update r k using Equation 14 . 27: return All-to-All-Reduce(result, AND) root r k . As shown in Equation 13 , this is an ordered set of evenly separated (of size m) positive integers starting from r k ; therefore, we allow each processor to generate its dataset associated with each inequality to reduce the overhead related to data distribution. After validating the testing set, the data distribution root r k at the processor P k is updated using Equation 14. The function Validate synchronizes schedulability results with all other executing processors using an All-toAll-Reduce operation with AND as the reduction operator. The Validate returns false even if there is a single violation.
The function MaxCarry evaluates the sequence ci 0 (
) for all pairs of modes and stores all the carry-in executions in a q × qmatrix ζ. For all pairs, we calculate ci g (M (i) , M (j) ) at each step g from the value calculated at (g−1)-th step, and store the value in ζ ij only if the new value is greater than the previous one. The function marks the change by setting the change flag to true. The newly calculated matrix ζ is synchronized using a All-to-All-Reduce operation with a MAX operator for each individual q 2 cell items. The function proceeds to the next step if there is a change in previously calculated carry-in executions (the change is true). The algorithm uses a Allto-All-Reduce operation with an OR operator to determine whether the change is set to true by at least one processor. The function proceeds to next step only if the change has a true value after the synchronization. Otherwise, the function returns with current values stored in q × q-matrix ζ.
Finally, if the execution of the algorithm reaches Line 34 of Algorithm 3 MaxCarry(M, k, r k ). for j = 1 to q do 9: c ← 0 10:
for all x in T
end for 14: Update r k using Equation 14.
15:
if min(c, Eij) > ζij then 18: ζij ← min(c, Eij )
19:
change ← true 20: end if 21: end for 22: end for 23:
All-to-All-Reduce(change, OR)
24:
All-to-All-Reduce(ζ, MAX) 25: until change = false 26: return ζ ParallelSA, we may safely declare that a multi-mode system M is EDF-schedulable.
V. PERFORMANCE METRICS
We investigate the asymptotic performance of our proposed parallel algorithm by using well known parallel performance metrics which include parallel execution time, speedup, efficiency, and cost [7] . The parallel execution time, denoted by T m , is the time elapsed between the start and the end of a parallel computation. The value of T m depends on the actual workload, number of processors m, and the parallel overhead. T m decreases at a slower rate as m increases. However, the parallel overhead also increases with m; therefore, after certain value of m, T m may not experience a noticeable decrease. T m for our algorithm is given by
The first term corresponds to the actual amount of work performed by each of the processors; it is obtained by dividing the total serial workload by the number of processors. The second term represents the overhead due to communication. The communication operation used in our algorithm is the Allto-All reduction among m processors which has a complexity of O(mk), where k is the size of the message on which reduction is performed [7] . Since the reduction operation is invoked O(C) times with a message size of q 2 (line of Algorithm 3) the overhead due to communication is given by O(Cq 2 m). The third term represents the overhead due to workload imbalances; per the discussion after Equation 14 , the difference, between any two processors, in the number of testing set elements is at most one. This testing set element requires O(n) to evaluate any SC Z .
The speedup S is defined as the ratio of the serial execution time T s of the best sequential algorithm to the parallel execution time. That is, S def = Ts Tm . The perfect speedup for a parallel algorithm equals to m which may be difficult to achieve for algorithms that require communication/synchronization for the correct operation. Due to overhead, S may decrease as m increases. The speedup our algorithm is given by:
Efficiency, denoted by E, is a measure of the fraction of time for which the processors are usefully employed in solving the problem; it is defined as the ratio of the speedup to the number of processors, E def = S m . The efficiency accounts for the parallel overhead, and usually decreases as m increases. The efficiency of our proposed algorithm is given by:
As mentioned in the previous sections, the last two terms in the denominator of Equations 16 and 17 are due to the parallel overhead which increases with m. As long as the overhead is smaller than the time required to perform the actual computation, the parallel algorithm remains scalable. We now determine conditions to ensure the scalability of ParallelSA using the concept of cost and cost-optimality. The cost is the sum of the time that each processor spends solving the problem, including the time to perform the actual work and the overhead due to communication/synchronization. A parallel algorithm is cost-optimal [7] if the cost has the same growth as the execution time of the fastest known serial algorithm. The following theorem develops conditions to restrict the parallel overhead:
Theorem 1: ParallelSA is cost-optimal if m = O( √ Cn). Proof: By definition, the cost of a parallel algorithm is the product of the parallel execution time and the number of processors. The cost of our parallel algorithm is given by:
For the problem considered in this paper, the execution time of the fastest known serial algorithm is O(C 2 q 2 n). As the growth for the first term of Equation 18 is the same as the growth of the execution time of the fastest serial algorithm, our parallel algorithm is cost-optimal if the second and the third term have the same growth as O(C 2 q 2 n); that is, m 2 = O(Cn) (for the second term) and m = O(C 2 q 2 ) (for the last term). A reasonable assumption of n ≤ C holds since C is an upper bound on the execution of tasks and each task has an execution of at least one. Since nC grows slower than C 2 q 2 , the algorithm is cost-optimal if the first condition is satisfied; that is m = O( √ Cn) which implies that as long as m grows slower than √ Cn, ParallelSA remains cost-optimal; thus, the theorem follows.
From the analysis in the previous paragraph, it is evident that whenever m grows slower than √ Cn, the overhead of the ParallelSA algorithm is less than O(C 2 q 2 n). It may be also shown that the speedup of ParallelSA for a computationally large problem (C n) is close to m (near-perfect speedup) as the first term in the denominator of Equation 16 dominates for a larger C.
VI. EXPERIMENTAL RESULTS
We perform experiments on a cluster of AMD Opteron computers which is part of the Wayne State University grid. Each computer has two 2.4GHz dual core processors and 4 or 16GB of RAM. The computers are connected through a Gbit Ethernet. We used MPICH-1.2.7 as the standard message passing interface. Value ranges for the parameters of the multi-modal system are listed in Table I . We have previously established the efficacy of the schedulability analysis in [6] over the previous state-of-the-art (Phan et al. [12] ); therefore, we measure the efficiency of ParallelSA in this paper.
For the simulation, we generate a set of 12 tasks from the parameters described in the Table I . Of the generated tasks, three are unchanged tasks and two are aborted tasks. We select at least eight tasks from the set for each mode M (i) . The resource parameter of a mode is set based on the parameters described in the Table I . In order to check the performance, we considered multi-mode systems with varying number of modes q ∈ {8, 12, 16, 20}. The ParallelSA algorithm, for each multimode system, is executed at least five times to reduce the effect on the execution time due to interference from other jobs in the grid. Among them, we took the minimum execution time for each multi-mode system. While checking the schedulability, the ParallelSA algorithm uses a total number of processors from the range [1, 24] . In Figure 4 , we present the execution time of the ParallelSA algorithm for parallel systems with various numbers of processors. In this figure, the horizontal axis represents the total number of processors m while the vertical axis represents the execution time. Clearly, ParallelSA requires a smaller execution time for a larger number of processors. Note that the decrease in the execution time with the higher number of processors is not linear. This is due to the parallel overhead of our algorithms.
To calculate the overhead of the parallel execution, we consider the algorithm called SUBI (Schedulability Using Bounded Iteration) proposed by Fisher and Ahmed [6] as the best known serial algorithm for the problem. We execute this algorithm for each multi-mode system using the same hardware resource setting in the grid. The overhead T o is calculated using the formula mT m − T s , where T s is the execution time of SUBI. Figure 5 shows the overhead versus the number of processors used. Like most parallel algorithms, the parallel overhead of our proposed algorithm increases with q due to the increased communication/synchronization cost. However, the overhead does not obtain a noticeable increase after a certain limit on m. This limit depends on the number of modes (e.g., for q = 8, the limit is 20). Figure 6 and Figure 7 show the parallel performance metrics for the ParallelSA. The speedup is calculated with respect to the execution time of SUBI. Figure 6 shows the speedup with respect to the number of processors. The speedup is close to the number of processors. Although it is not discernible in Figure 6 , the speedup is slightly better for a larger number of modes. In Figure 7 , we present the efficiency of the ParallelSA with respect to the total number of processors used. The parallel efficiency is calculated from the speedup and the number of processors used. The efficiency varies between 90−98% in our experiments. Like the speedup factor, efficiency varies with the number of modes and parameters associated with each mode. We obtain better efficiency for higher number of modes. One possible explanation could be the amount of workload to share among processors which increases with the number of modes.
In Figures 6 and 7 , there are spikes at m = 2 due to the higher interference from other running jobs in grid. Each computer node in the grid has four cores, and the grid job scheduler assigns a single core for each processor requested unless explicitly specified. For m = 2, only two cores of a computer node are used by the ParallelSA, and the remaining two cores may be utilized by other running jobs in the grid. Cores in the same node share memory and cache; therefore, the interference from outside jobs is higher for m = 2 than for m being a multiple of four where each node is occupied only by our schedulability test during its execution.
VII. CONCLUSION
In this paper, we proposed an algorithm for the parallel schedulability analysis of real-time systems with multiple hardware and software modes. The proposed parallel schedulability test is designed such that the overhead associated with the parallel execution is minimized to obtain better speedup/efficiency. The experimental results substantiate the efficacy of the proposed algorithm for parallel schedulability analysis; therefore, the algorithm can be used as an effective tool for the exploration of design space while searching for optimal parameters of a multi-mode real-time system. Currently we are working on developing algorithms for allocating the minimum resource supply for each mode. Our ultimate goal is to restrict the runtime complexity of capacity determination to a pseudo-polynomial number of candidates. Obtaining the minimum capacity for a multi-modal system opens doors for further fruitful research in designing real-time control systems that minimize the peak-temperature or energy consumption.
