Abstract-A major trend in a modern system-on-chip design is a growing system complexity, which results in a sharp increase of communication traffic on the on-chip communication bus architectures. In a real-time embedded system, task arrival rate, inter-task arrival time, and data size to be transferred are not uniform over time. This is due to the partial re-configuration of an embedded system to cope with dynamic workload. In this context, the traditional application specific bus architectures may fail to meet the real-time constraints. Thus, to incorporate the random behavior of on-chip communication, this work proposes an approach to synthesize an on-chip bus architecture, which is robust for a given distributions of random tasks. The randomness of communication tasks is characterized by three main parameters which are the average task arrival rate, the average inter-task arrival time, and the data size. For synthesis, an on-chip bus requirement is guided by the worst-case performance need, while the dynamic voltage scaling technique is used to save energy when the workload is low or timing slack is high. This, in turn, results in an effective utilization of communication resources under variable workload.
synthesis algorithm was proposed, which is, however mainly based on the bus templates of AMBA [1] with standard bus widths.
All the above techniques synthesize an application specific bus architecture, which may fail to meet the real-time constraints, if the communication behavior of tasks is random. Recently, a technique for dynamic re-configuration of a synthesized bus topology was studied in [19] to cope with variable workload. The approach optimizes a post synthesis bus architecture for the re-configuration of bus protocols. A bus scheduling approach for aperiodic tasks was proposed in [6] , which models random tasks and schedules them for a synthesized bus architecture. This approach also deals with the bus optimization technique rather than synthesizing a bus architecture. However, the random tasks modeling technique proposed in our work is similar to their approach.
The idea of scaling voltage of a task by exploiting its timing slack for energy reduction is not new. The technique presented in [10] proposes dynamic voltage scaling of a microprocessor under variable workloads, while the work of [7] used a voltage scaling technique for both processor and communication bus for energy reduction of IPs and bus architecture. However, the approach is only used for the power optimization of a post synthesis bus architecture. Recently, a simultaneous bus synthesis and voltage scaling technique was presented in [14] , [15] , which finds the optimal bus width and the number of buses. Furthermore, it explores a trade-off between communication resources and power consumption during bus synthesis. The bus synthesis technique in our work is similar to [14] , [15] , however, the proposed approach in [14] , [15] was limited to a task with a deterministic arrival time. Thus, the synthesized bus architecture may fail to meet the real-time constraint if the arrival time and rate of tasks is random due to the partial re-configuration of a system.
The main contribution of this work is to synthesize a bus architecture in the presence of random tasks arrival. As a result of this, the synthesized bus architecture is robust for a given probability distribution of random tasks. The randomness of tasks is modeled with three parameters, i.e., task arrival rate, inter-task arrival time, and data size with their probability distribution functions. The bus synthesis is guided by the worst-case performance need, while the dynamic voltage scaling technique is used to reduce the energy when the timing slack is high or the workload is low. The dynamic voltage scaling technique presented in this work is similar to [7] , [14] , [15] . In this paper, the bus synthesis problem is formulated as an optimization problem and solved using a convex optimization tool. The experiments carried out on automatically generated tasks and real-life multimedia applications validate the proposed bus synthesis technique under random task arrival and show that the synthesized bus architecture is robust for a given distribution of random tasks.
The reminder of this paper is organized as follows. In Section II, we give preliminaries on the target architecture model and com-munication tasks. Section III introduces a motivational example for bus synthesis under random task arrival rate, inter-task arrival time, and random data size. Section IV derives a model for random task arrival rate, inter-task arrival time, and data size to synthesize a bus architecture. Section V gives a mathematical formulation and optimization techniques for bus synthesis and voltage scaling problems. The continuously scaled voltages of each task are transformed into discrete voltages in Section VI using an voltage selection algorithm. In Section VII, we present case studies and results to validate our bus architecture synthesis method under random task arrival and finally, in Section VIII, we give the conclusion of this work.
II. PRELIMINARIES
We consider an embedded system which is realized as a multiprocessor system-on-a-chip (MPSoC). Such a system consists of several on-chip processing modules like general-purpose processors, application specific integrated processors (ASIPs), application specific integrated circuits (ASICs) or field-programmable gate arrays (FPGAs). These on-chip modules communicate with each other by transferring data through a shared bus. We assume that a system has been partitioned into HW/SW and mapped efficiently onto the appropriate modules of an SoC as shown in Fig. 1(a) . In the figure, a set of tasks τ ∈ T , which is mapped onto a module, is called data processing tasks. These tasks are for processing data such as fast Fourier transformation (FFT) or discrete cosine transformation (DCT) or any computation within a module. After processing data, a driver of a module, establishes communication between modules and transfers data for further processing. All communications c ∈ C that take place among the on-chip modules using on-chip buses are captured by communication tasks c i as indicated by black boxes as shown in Fig. 1(b) . Since a complex system runs a diversity of applications on a single SoC, the workload offered to an embedded system is not uniform over time. This introduces randomness on size of data to be transferred, communication task arrival rate, and inter-task arrival time. These parameters are extracted by profiling a HW/SW system for different scenarios at system level using the following formulae
Eq.
(1) and (2) give the task arrival rate and the inter-task arrival time for a given window ti and tj, respectively. Based on the profiling of a system, communication tasks and their dependencies are modeled as shown in Fig. 1(a) , where communication tasks c1, c2, and c3 with solid lines are the tasks for one scenario. While additional tasks c4 and c5 with dotted lines are for another scenario. Each communication task in the figure takes certain time duration to transfer data. This duration is called a communication lifetime interval (CLTI), which is a function of data size, bus width, and voltage. From the extended task graph G E (T, E) a directed acyclic communication task graph GC (C, Π) is obtained to schedule communication tasks for different bus widths and voltages. In Fig. 1(c) , a node c ∈ C is a communication task, while an edge π ∈ Π gives the dependency between the communication tasks. Further, an edge between two nodes c i and cj weighted with w is the data processing time of a task τi, which gives an early start time constraint for a successor cj to transfer data using a bus. The data processing time of each task τ can be evaluated as 
where NCτ is the number of cycles to execute a task τ and T d is a gate delay, which is a function of voltage and technology dependent parameters [20] as shown in Eq. (4). Similarly, the CLTI of each communication task c is modeled as
where NBc(ζ) (number of bits) is a random size of data to be transferred by a task c with bus width br, and supply voltage V dd . In Eq. (4) notations κ1, κ2, κ3, and α are the technology dependent parameters. The dynamic energy consumption of communication task c is modeled as
where C ef f is the effective switched capacitance of the communication bus. The energy overhead for switching from voltage Vi to Vj is ε
where Cr is the capacitance of the power rail. The time overhead for switching from Vi to Vj is given by
where ρ is a constant.
III. MOTIVATIONAL EXAMPLE
In this section we present a motivation for bus synthesis under a random task arrival rate with a random inter-task arrival time and illustrate that the synthesized bus is robust for a given probability distribution of the task arrival rate, the inter-task arrival time, and the data size. The voltage scaling technique is used to reduce the bus energy consumption when the data traffic is low or the timing slack of the communication tasks is high. This, in turn, results in an optimum utilization of the buses under random data traffic. We consider the partitioned and mapped system as shown in Fig. 1(a) and schedule the tasks to synthesize a bus architecture. Fig. 2 depicts task scheduling and bus synthesis for three different scenarios, where each scenario is characterized by average task arrival rate λ n, average inter-task arrival time λτ , and random data size NB(ζ). In the figure a black rectangle denotes the data transfer delay at Fig. 2(d) does not meet the real-time constraints, as the tasks overlap with each other. Thus, two shared buses are needed to meet the time constraints as shown in Fig. 2(e) . Similarly, in Fig. 2(c) five communication tasks with average intertask arrival time 2.8ms are depicted. After scheduling of the tasks, two shared buses with interconnection of modules are shown in Fig. 2(e) .
Among the three different scenarios shown in Fig. 2 the worstcase performance needed is λn = 5 and λτ = 2.8ms. Thus, the bus is synthesized considering the worst-case scenario and the voltage of communication tasks is scaled to reduce the energy consumption when the average number of task arrivals λ n is low and the average inter-task arrival time λτ is high or the timing slack is high. This effectively utilizes the bus resources even when there is a variation in data traffic. In Fig. 2(c) , the slack of the communication tasks c1, c2, and c5 can be used to scale the voltage, while the voltage of other tasks should be kept to the nominal voltage. Similarly, for the scenario with λn = 3 and λτ = 2.66ms, the slack of communication tasks c1, c2, and c3 can be exploited for energy reduction. Intuitively, the more the slack of communication task is, the less is the energy consumption due to voltage scaling.
IV. MODELING OF RANDOM TASKS
We assume that the communication tasks c ∈ C and their arrival rate and inter-task arrival time have a Poisson distribution. Eqs. (9) and (10) give probability density functions of the task arrival rate and inter-task arrival time, respectively.
where c = number of communication tasks λn = average arrival rate of communication tasks in a given time interval [ti, tj]
where t = arrival time of a communication task λτ = average inter-task arrival time of communication tasks
The probabilistic constraint of random task arrival rate can be expressed as
where βc is a confidence level such that in Eq. (11) the number of task arrivals in a time interval [ti, tj] is less than or equal to c l with a probability βc. After an algebraic manipulation of Eqs. (11) and (9), Eq. (11) can be expressed as
Similarly, the probabilistic constraint of random inter-task arrival time can also be modeled as Eqs. (11) and (12) with confidence level βτ . In Eq. (12) the notation Kc is a constant term. Further, data size to be transferred by each communication task c ∈ C is modeled as a random variable with a known probability distribution function. Let dl be the deadline of task c ∈ C then the relation between the CLTI and deadline dl can be written as [14] , [15] ∀c ∈ C, P (dl c − CLT Ic,r,V dd ,V bs − δ
Eq. (13) gives a probabilistic delay constraint for each task c ∈ C such that the data transfer delay of each task should be less than or equal to the deadline. The notation η can be considered to be a confidence level. This constraint can be transferred into the deterministic constraint as follow [14] , [15] dlc − μ CLT I (NBc,
where μ CLT I (NBc, T d ) and σ CLT I (NBc, T d ) are mean and standard deviation of the CLTI, respectively. The term φ −1 (·) is an inverse of the error function. Intuitively in Eq. (14) the notation η controls the scaling of voltage during bus synthesis. For different arrival rate, arrival time, and data size of tasks, the confidence level η sets the voltage to a certain level 1 so that the standard deviation of delay σ CLT I (NBc, T d ) is changed in order to meet the real-time constraint and to utilize the bus resources effectively.
V. SIMULTANEOUS BUS SYNTHESIS AND VOLTAGE SCALING
For the simultaneous bus synthesis and voltage scaling problem, the data processing tasks τ and communication tasks c are scheduled together, where voltage is scaled to reduce the energy consumption when the workload of a system is low or the timing slack is high. The formulation of an optimization problem is given as follows:
subject to,
X c,t ,r ≤ br : ∀t ∈ Ω, r ∈ R (20)
The objective is to minimize the communication bus cost (bus width and number of buses) as shown in Eq. (15) , where ri ∈ R is a library of on-chip buses with different bus widths. The Cr of each bus ri is expressed in terms of the bus width, e.g., the cost of a 32-bit wide bus is twice the cost of a 16-bit wide bus and is stored in a lookup table. In Eq. (16), summation of start time sτ , execution time wτ,V dd ,V bs and switching overhead δ ΔV i,j of each task τ should be less than or equal to its deadline dlτ . Further, a task τ can start its execution only after its predecessor (communication task c) completes transferring data as shown in Eq. (17) . A binary decision variable X c,t,r ∈ {0, 1}, indicates scheduling of a communication task c at time t ∈ {0, · · · , λ}, with a bus width r as shown in Eq. (18) . The term λ is the maximum possible time to schedule a task c. Eq. (19) gives a dependency between successor (communication task c) and predecessor (data processing task τ ) such that a task c is scheduled at time t to maximize sharing of buses and to scale voltage for energy reduction. Let Ω = ∪c∈C {ASAP c , · · · , ALAPc} be a time window such that the tasks that are scheduled within this interval could overlap. If the timing of a task overlaps with another task then the task is assigned to a separate bus with index b and width r as shown in Eq. (20) [14] , [15] . Since, the delay interval CLT I c,r,V dd ,V bs of a task c is a function of the two random variables data size NBc and gate delay T d (see Eq. (5)), Eq. (21) gives a probabilistic constraint such that the overall delay of each task c must be less than or equal to the deadline dlc with a confidence level η. Its equivalent deterministic constraint is given in Eq. (14) . Similarly, the probabilistic constraints for random task arrival rate for τ ∈ T and c ∈ C 10 do 11
Algorithm 1: Discrete supply and body bias voltages selection.
and inter-task arrival time are given in Eqs. (22) and (23) with their confidence level βc and βτ respectively. The supply voltage V dd and body bias voltage V bs of both tasks τ and c are scaled continuously and their constraints are given in Eq. (24). In the above formulation, the objective function is linear to the optimization variable ri and the probabilistic constraints (Eq. (21), (12), (22), and (23)) are non-linear to voltage and optimization variable ri. Thus, the above described simultaneous bus synthesis and voltage scaling problem belongs to the convex quadratic optimization problem [15] , which finds a global optimal solution in a polynomial time complexity [12] .
VI. DISCRETE VOLTAGE SELECTION
As the continuous voltage scaling technique gives an ideal energy reduction characteristics, it cannot be applied for digital design due to the limitations of a voltage regulator. Thus, a heuristic is proposed to transform continuously selected optimal supply voltage V to their upper bounds of V dd and V bs , respectively. We could choose the lower bounds of supply and body bias voltages to get the minimum energy consumption, however, this may violate the given real-time constraints. At line 7 of the algorithm, delay constraints of each task τ and c are checked with their deadlines dl τ and dlc, respectively for near-optimal V dd and V bs . If the condition is met then the heuristic returns those near-optimal voltages at line 14. Otherwise, at each time next supply voltage, which is greater than the V dd is selected from the set of discrete supply voltages V ddz at line 11.
VII. CASE STUDIES
We validate the effectiveness of the proposed technique using a generated benchmark as well as a real-life multimedia applications, i.e. an audio decoder [2] and a speech recognition system [3] . The automatically generated benchmark consists of 119 [15] communication tasks and the data size to be transferred by each task is a normally distributed random variable with mean data size (μ NB ) 16, 32, 64, 128, 256, and 512-bit and standard deviation 3σNB = 40% of μNB. Each data processing task τ and communication task c can scale their supply voltage from 1.4V to 0.8V and the body bias voltage from 0V to -0.8V. The on-chip communication buses are given as a library of buses with different bus widths, which range from 16 to 128-bit wide. For the experimental purpose, we consider a bus with [10] . Other technology dependent parameters for 70nm node were adopted from [4] . The bus synthesis algorithm was implemented in C as a pre-processing model to interface with a convex solver MOSEK [5] .
The first set of experiment was carried out on the automatically generated tasks with an aim to synthesize a robust bus architecture in the presence of random communication tasks with a random arrival time. The average number of task arrivals (λ n) is 4 for a time interval of 6 sec. The confidence levels for the task arrival rate βc and task arrival time βτ are set to 99%. While the confidence level η of tasks are set to 81%. We performed simultaneous voltage scaling, scheduling, allocation, and binding of communication tasks using the proposed optimization technique presented in Section V. Table I presents the synthesized bus widths and the number of buses for different inter-task arrival times λ τ and task arrival rates λn. The results are compared to the results of [15] with the deterministic task arrival (i.e., λτ = 1.0, λn = 19, 3σNB = 15%, and η = 89% in the last row). The results show that the synthesized bus architecture considering a deterministic task arrival does not meet the real-time constraints for tasks with random task arrival. (Note, in [15] only the supply voltage of tasks is scaled, thus the mean supply voltage in the table for λ τ = 1.0 and λn = 19 is high, it is because of V dd and V bs scaling.) In the column entitled λτ , the average inter-task arrival time of tasks is normalized to the maximum inter-task arrival time λτ (max). In the columns entitled Synthesized Bus (2 and 6), the synthesized bus widths and the number of buses are presented for different λn and λτ . The results show that the bus widths and number of buses increase with decreasing inter-task arrival time as shown in columns Synthesized Bus (2 and 6). Intuitively, the smaller the intertask arrival time is, the larger is the number of overlaps among the tasks. This in turn results in an increment of communication bus cost. In the columns entitled Bus Cost, the cost was evaluated in terms of bus area so that the cost of 16, 32, 48, and 64-bit wide buses are 1, 2, 3, and 4, respectively. Similarly, column Synthesized Bus (6) shows the synthesized bus widths and the number of buses for average task arrival rate 23. In the columns entitled μ V dd and μ V bs (4, 5, 8, and 9) mean supply and body bias voltages were evaluated for different task arrival rates and inter-task arrival times.
The second experiment was conducted on real-life multimedia applications, which include an Ogg Vorbis decoder [2] and a speech recognition system [3] . The audio decoder includes four main decoding steps, which are inverse quantization, channel decoupling, reconstruct curve, and IMDCT. After manually partitioning and mapping of the decoder, the IMDCT was mapped to a single hardware and the rest of the functionality was mapped to a processor. Furthermore, raw audio data was mapped to a compact flash (CF) memory with an CFinterface. The extracted audio data was mapped to an audio buffer for streaming. Similarly, the second speech recognition system consists of three main components: front end, decoder, and linguist. The front end includes series of data processing tasks such as pre-emphasis, hamming window, FFT (fast Fourier transformation), mel frequency filter, IFFT, cepstral mean normalization, and feature extraction to generate the features from the speech. The speech system takes as input a large number of speech along with their transcriptions into phonemes to provide the speech models for the phonemes. The recognition is based on the HMM (hidden Markov model) to decode the speech. The American English lexicon consisting of 32 phonemes and a database of 17 different words has been used (spelling out the names of the months, numbers and digits) [11] . After partitioning of the speech system, the front end was mapped to a dedicated hardware including FFT and filters. The task training and recognition were mapped to a PowerPC processor. Based on the partitioned and mapped system communication tasks graph, their arrival rate and time were extracted by profiling the HW/SW system. Fig. 3 and 4 show the synthesized bus widths and number of buses for a multimedia application. For an average task arrival rate 13 and inter-task arrival time 0.73, three buses with bus widths 24, 32, and 48 are required to meet the real-time constraints. The mean supply voltage μ V dd and body bias voltage μV bs are 1.31V and -0.33V respectively. However, for another scenario with task arrival rate 17 and inter-task arrival time 0.59 the synthesized buses of Fig. 3 is not robust due to the overlaps among the communication tasks. Thus, an additional bus of 32-bit wide is required as shown Fig. 4 , which is robust for a given distribution of task arrival rates, inter-task arrival times, and the variation in data size. For the synthesized bus architecture of Fig. 4 , the mean supply voltage μ V dd and body bias voltage μV bs are 1.28V and -0.38V, respectively. In order to utilize the bus architecture effectively over time, a dynamic voltage scaling technique is used when the workload is low or the timing slack is high. Further, the memory architecture is synthesized based on the algorithm presented in [13] . The memory synthesis algorithm is based on clique partitioning of a data dependency graph.
Summarizing the experiments, we synthesized bus architectures for automatically generated tasks and real-life multimedia applications incorporating both random tasks arrival time and task arrival rate. The synthesis results in compare to the tasks with a deterministic arrival rate and arrival time (i.e., λn = 19 and λτ = 1.0 in the last row) show that the bus width and the number of buses change for different arrival rates and arrival times. Thus, the bus synthesis technique without considering those parameters fails to meet the real-time constraints.
VIII. CONCLUSION
In this paper, we proposed a robust on-chip bus architecture synthesis technique in presence of random on-chip tasks. The term robust means that the synthesized bus architecture meets the real-time constraints for different scenarios. The task arrival rate, the intertask arrival time, and the data size are modeled as random variables with known probability distribution function. The bus architecture synthesis technique is formulated as scheduling, allocation, and binding problems. Once correctly formulated these problems are solved with the help of an optimization tool, which finds the optimal bus widths and the number of buses for a robust on-chip communication. The dynamic voltage scaling technique is used to reduce the energy consumption and to utilize the bus resources effectively. The experiments conducted on the automatic generated tasks and the reallife multimedia applications validate the effectiveness of the proposed technique under random on-chip tasks.
As part of future work, we intend to apply dynamic reconfiguration of the communication bus topology so that the bus resources utilization factor can be improved effectively.
