Abstract-Concurrent processing has become the default mode of operation in on-chip systems. Silicon has become cheap enough for having hardware facilities to support very large scale concurrent processing on chip. As a result the availability and applicability of power is becoming more of a limiting factor than logic. However, the advantage of parallelism in reducing power consumption will soon become unrealistic because of the limited scope of reducing Vdd beyond threshold voltage, leaving the reduction of concurrency (through the partial shut-down of system blocks) as a realistic means of reducing power consumption when needed. A stochastic modelling approach is presented in this paper which can integrate the degree of concurrency as a parameter into power and latency analysis. This will facilitate a system design and management regime where the degree of concurrency is used as a means of control to achieve power and performance goals.
INTRODUCTION
Concurrent processing has been shown to be a successful solution to improve the execution speed for on-chip and onboard systems [1] . A high degree of concurrency can distribute processing loads to multiple blocks in a system, improving throughput and reducing latency. Although the development in semi-conductor technology has made it possible for a chip to integrate multiple blocks without much size increasing [2] , the power dissipation becomes the main bottle neck to improve the performance of concurrent systems. Although increasing concurrency can sometimes be used to reduce power consumption, it is under the assumption that one could also reduce Vdd and/or clock frequency at the same time [3] . Current technologies allow very low Vdd which could not be further reduced at run time. In this case and with systems having the hardware and software resources to support a very high degree of concurrency, power may become the limiting factor on run-time concurrency. Power applicability is limited by both supply availability (battery, scavenged power, etc.) and EMI and overheating issues and can vary with time. A description of applicable power P a over time is known as a power profile (Figure 1(a) ). Even with a stable power supply, high peak power may cause EMI noises and overheating which may reduce system performance and lifetime [4] . Deciding on the right degree of concurrency so as to optimize the power and latency performance in a system has become a key problem. This concept of systems being limited by applicable power and designing systems according to such limitations (called power-elastic design in this paper) is different from conventional low power design [5] .
For systems which have the hardware and software resources to support maximally concurrent processing of potential application tasks but variable and nondeterministic applicable power, an elastic power manager may be needed to control the permitted degree of concurrency in real time based on power applicability information. Such a control scheme is shown in Figure 1 (b) . Here the power manager consists of a power profiler which translates real time available power, consumption information, and power limiting factors such as thermal behaviour into a current and predicted power profile, and a concurrency manager which decides on the correct degree of concurrency for the system based on this profile.
Pa(t) t

Power profiler
Concurrency manager 
A.
Related work A power sensitive system can be described as a service provider (SP) dealing with incoming service requests (SRs). A task is employed by an SP (usually a computing block) to serve a certain type of SR and the SP only consumes power when at least one task becomes active for request-servicing.
Markovian methods have been used to model such systems for decades [7, 8, 9] when multiple tasks are provided by an SP to deal with nondeterministic SR arrivals. Each task is taken as independent in an SP's processing. Normally, λ is used to denote the rate of request arrival, μ stands for the rate of task processing in an SP, and P represents the power consumption when an SP is on. Because this kind of modelling method can relate the average power consumption P ave and latency L of hardware/software designs to parameters λ, μ, and P, it can help derive optimal power-latency tradeoffs in design.
Although much research, such as discrete-time [8] and continuous-time Markov processes [9] and fine-grain Markov model [7] exist for such modeling analysis, only the case of a single SP has been studied. In this paper, we consider the multi-SP case and investigate the optimized or permitted concurrency degree for a certain power-latency consideration.
B.
Contributions and organization The main contribution of this paper is the development of a general modelling approach where the system degree of concurrency is related to power and latency performance. Such a method of modelling supports qualitative and quantitative analysis for designers of power-elastic systems who may want to compare different power management algorithms under different operational situations.
The rest of this paper is organized as follows: Section III describes a general method of modelling non-deterministic systems highlighting the degree of concurrency using Markovian techniques. Section IV further develops such models to cover uncertainty in the degree of concurrency to accommodate the use of soft arbitration as a concurrency management technique. Section V develops the method to include representation for cases where transitions between waiting, execution, and idle states are not considered cost free.
II. MODELLING OF CONCURRENCY, POWER AND LATENCY
We assume that the system being controlled consists of identical SPs which can be independently woken up or shut down. When an SR arrives, one SP is chosen (SP selection is out of the scope of this paper) where the corresponding task is activated to service the SR. At most M (1≤M≤N) SPs can be powered on at the same time (the maximum concurrency degree is M) limited by the applicable power profile.
Suppose there are N independent tasks in the system. Because of task independence, it is possible for multiple tasks to be active (and waiting) at the same time. We use j (0≤j≤N) to indicate the number of idle tasks (tasks whose corresponding SRs have not arrived) in the system.
If no more than M tasks are active (N-j<M), only N-j SPs are needed for task execution. The other M-N+j SPs can be powered off for power saving. After the completion of execution, a task becomes idled again.
If at least M tasks are active (N-j≥M), the system must operate at the maximum degree of concurrency. However, the other N-j-M tasks have to wait in a queue until some SPs have been released on task completion. The waiting queue is assumed to follow a first-come first-served policy.
A.
Modelling of the degree of concurrency Figure 3 is the stochastic model for the type of system investigated in this work. In this model, we use the number of idle tasks as the state variable. For example, in the state N, all tasks are idle and all SPs are powered off. Since the system moves from the state N to N-1 when any one of the N tasks is activated, the corresponding transfer rate is Nλ (each idle task leaves the idle state at the same rate λ). A task is executed in an SP with the execution rate μ (a task in execution leaves the execution/active state and becomes idle again at the rate of μ). In order to simplify the model, both the power on and power off mode switches for an SP are taken as cost free in both time and power (with a rate of infinity and delay of zero).
If one of the other N-1 tasks becomes active before the execution of the first active task is completed, the system moves to the state N-2, and another SP is powered on for task processing accordingly. With two tasks being executed, the rate of one of them leaving execution and becoming idle is 2μ. When the system is in the state j=i (N-M<i<N), there are N-i active tasks being executed, and it may move to the state i-1 with the transfer rate i×λ. With N-i SPs on for processing in the state j=i, the execution rate becomes (N-i)μ.
When the system is in the state j=i (i≤N-M), all M SPs are already on for task execution, and the corresponding transfer rate from the state j=i to j=i+1 is constant Mμ.
The probability distribution of all states can be calculated analytically and numerically.
B.
Power, latency and combined analyses To differentiate power from probability distribution, we use Q j to stand for the probability when the system is in state j (j≤N), and P as one SP's power consumption. Many low power technologies can be used to power on/off SPs in a concurrent system. For example, clock gating stops the propagating of clock signals and power gating terminates the power propagation [3] . In this high level model, we simply assume an SP consumes full power P when it is on and zero power when it is off. Therefore, the power dissipation when the system in the state j=i is (N-i)P when N-M<i<N or MP when i≤N-M. The average power consumption of the system P ave (M) is presented in Equation 1:
where the probabilities can be expressed in the rates of the system (λ and μ) by first expressing all Q i in terms of Q 0 : For the measure of latency, we use W, the average time spent by a task in both waiting and execution stages, i.e. between activation and becoming idle again. It can be derived as follows:
First of all, if L=Ave(j) is the average number of idle tasks, it can be calculated using Equation 2:
The average number of active tasks is then N-L. Meanwhile, the arrival rate into active states is given by λL.
The average latency W(M) is thus described in Equation 3
where T is the average time for executing a single task:
If power is not a limiting constraint, but a factor that can be balanced with performance, an optimum M may be found for any particular power and latency balance. This type of
Figure 3 System Markov chain model
optimization can be described as follows. Given a certain weight C(0<C<1), for any possible concurrency degree M (1≤M≤N), the one which can minimize CP ave (M)+(1-C)W(M) is the optimized M (M opt ). In other words, M opt satisfies:
This kind of analysis can be done at design time for power-elastic systems so that at run time, power profile permitting, the concurrency degree can be set close to M opt .
C.
Case study Here we present a case study with N=15 and normalized P, T and μ of 1 (each SP consumes one unit of power in execution at the completion time and rate of 1). Figure 4 to Figure 7 describe the power and latency performance for various values of M. In general, with larger M the power consumption is higher and the latency is lower. When balancing these two, it is possible to find some optimal M for some weighting factor C. When the system is saturated with high λ values, the power consumption is asymptotically MP. With large M however the arbiters can become complex with performance and cost penalties. Fortunately, unlike hard enumerable resources such as software threads and hardware blocks, power is different in that it allows a degree of softness in arbitration. In other words, a degree of imprecision in M (occasionally allowing more than M SPs to run at the same time) is tolerable. This permits the use of soft arbiters [6] which are much cheaper to implement and run.
Figure 8 Uncertainty in M
The model in Figure 8 includes representation for cases where, at a probability of 1-α, soft arbitration allows M+1 SPs to execute at the same time. The *-marked branch can have its own rate of execution (μ*) and power cost (P*) because of potential Vdd droop in such cases. The average power consumption can be calculated as:
In Equation 5, we use Q j and Q j* to represent the probabilities when the system is in the states j or j*. Latency behaviour can be similarly derived.
IV. WAKEUP AND SHUTDOWN COSTS
So far we have assumed zero power and latency costs for switching a SP from off to on and back. In reality such mode switching always costs both time and power. In this section we further develop our models to cover mode switching costs.
In order to represent these properly, mode switching processes need to be incorporated into system states. Because wakeup is a process that happens between waiting and execution, the previous practice of grouping waiting and execution together into "active" and representing the system state with just the number of idle processes j is no longer applicable. For models in this section, we represent a system state with four state variables: k = the number of SPs in the shutdown process, h = the number of SPs in the wakeup process, c = the number of active tasks waiting in the queue, j = the number of tasks in idle. The system state is thus s = {k, h, c, j}.
The life cycle of a task is in idle (one of j), in waiting (one of c), and in execution. The status of an SP is off, in wakeup, in execution, and in shutdown. The completion rate of the shutdown process of a single task is γ. The completion rate of the wakeup process of a single task is δ.
For the case where there is no active task waiting in the queue (c=0), a basic tile of the model is shown in Figure 9 .
(k+1,h,0,j)
Figure 9 Non-zero wakeup and shutdown costs with empty queue
When the system is in the state {k, h, 0, j}, the first of the k SPs in shutdown to complete will do so at the rate of kγ. This will cause the system to change to the state {k-1, h, 0, j+1} (one fewer SP in shutdown and one more task in idle). The first of the h SPs in wakeup to complete will do so at the rate of hδ. This will transit the system into the state {k, h-1, 0, j} (one fewer SP in wakeup). The other two state transitions represent the first executing task to complete execution and the first idle task to become active. Combining tiles in the forms of Figure 9 and Figure 10 , A full model with representation of non-zero wakeup and shutdown costs has been constructed. This model, which is too complex to show here, assumes that it is not possible to have true simultaneous completion of the wakeup, shutdown, task becoming ready, and execution processes. This assumption is reasonable because the granularity of this modelling approach treats the start and completion of these processes as atomic, and the processes themselves as non-atomic.
Models of this type can be used for power and latency analysis in the same way as those presented in Sections II and III. For example, in Figure 11 we compare the zero wakeup and shutdown cost model with a non-zero cost one for the same example system with M=5. This particular case shows that it is possible for these types of systems to consume more average power than with all M permitted SPs on. With very large λ the two curves converge because all permitted SPs are constantly on and no mode switching takes place. 
Figure 11 Comparing zero and non-zero cost models
With these types of analyses system designers can discover such operating conditions and incorporate allowances for them when designing power control algorithms.
V. DISCUSSIONS AND FUTURE WORK
A modelling approach for the analysis of power and latency performance highlighting the effect of the degree of concurrency for multi-block systems is presented in this paper. This approach and the analyses it supports are essential for designers of power-elastic systems using concurrency management as a way of making maximum use of applicable power. It is also a generally important development where the degree of concurrency is properly modelled in the context of power consumption and latency. Currently the power and latency costs of the power controller itself are not modelled and we plan to extend our models to cover these. In parallel to this work, we are also developing hardware solutions for power-elastic controllers based on concurrency management.
