Modern system design is being increasingly driven by applications such as multimedia and wireless sensing and communications, which have intrinsic quality of service (QoS) requirements, such as throughput, error-rate, and resolution. One of the most crucial QoS guarantees that the system has to provide is the timing constraint among the interacting media (synchronization) and within each media (latency). We have developed the first framework for system design with timing QoS guarantees. In particular, we address how to design system-on-chip with minimum silicon area to meet both latency and synchronization constraints. The proposed design methodology consists of two phases: hardware configuration selection and on-chip memory/storage minimization. In the first phase, we use silicon area and system performance as criteria to identify all the competitive hardware configurations (i.e., Pareto points) that facilitate the needs of synchronous applications. In the second phase, we determine the minimum on-chip memory requirement to meet the timing constraints for each Pareto point. An overall system evaluation is conducted to select the best system configuration. We have developed optimal algorithms that schedule a priori specified applications to meet their synchronization requirements with the minimum size of memory. We have also implemented on-line heuristics for real-time applications. The effectiveness of our algorithms has been demonstrated on a set of simulated MPEG streams from popular movies.
INTRODUCTION
Multimedia applications have intrinsic requirements on deadlines to process the incoming data (latency) and the coherent playout of different types of data (e.g., synchronization among text, image, audio, and video or multiple video/audio streams). The timing relationships among the interacting media and within each medium are referred to as synchronization and latency, respectively (or intramedia synchronization and intermedia synchronization in some contexts). These synchronization requirements must be satisfied at the time when various media (e.g., the display of text and images, the dynamic playout of audio, video and animations) are delivered to the user. For example, the lip-sync of audio and video usually requires 25 or 30 synchronization points per second.
Memory structure plays a vital role in the design of an embedded multimedia system that must provide such synchronization guarantees. A typical application-specific system-on-chip (SoC) consists of a processor core, instruction cache, data cache, background memory, and a set of optional hardware accelerators and control blocks. These components often comprise more than 70% of SoC area. Memory hierarchy, in particular, cache and on-chip memory, is the base for many synchronization-enhancing techniques. It also impacts significantly the embedded system's performance, power dissipation, and overall implementation cost. The synthesis of SoC for synchronous multimedia applications poses several interesting synthesis and optimization problems. For example, data storage and data processing are equally important in multimedia applications. The lack of either storage or processing power may cause violation of the timing constraints. We can use a powerful processor, which usually takes more silicon area, and large cache to provide fast processing speed. However, this limits the on-chip memory for data storage if the total silicon area is fixed. Our goal is to investigate how to balance this trade-off between the processor and the memory such that the synchronization requirements are satisfied with the minimum silicon area.
In this article, we lay out a system design framework that simultaneously optimizes traditional design targets (such as area, cost, power, throughput, testability, and scalability) and provides QoS guarantees. We illustrate this by a discussion of, but not limited to, SOC design with minimum silicon area that meets the application's synchronization and latency requirements. We propose a two-phase design methodology: hardware configuration selection and storage minimization. Different processor cores, combined with different configurations of I-cache and D-cache, have different performance. In the phase of hardware configuration selection, we exclude a combination if it requires more silicon area but does not produce better performance. For each of the remaining hardware configurations, we determine the minimum storage requirement to satisfy the QoS guarantees by task scheduling. Then we conduct the system performance evaluation and select the best system configuration based on optimization targets. We develop both offline and online scheduling policies. Our offline algorithm is provably optimal in minimizing the storage under the timing constraints. The earliest deadline first (EDF) algorithm is one of the most widely used online scheduling policies, but it does not give any synchronization guarantees. Experimental results show that our online heuristic is competitive with EDF while providing the timing QoS guarantees.
The highlight of this article is a dynamic programming-based algorithm that finds the minimum storage requirement and a feasible schedule for a single processor to service a set of applications. The algorithm also gives a feasible schedule with this minimum storage. All applications' timing constraints (latency and synchronization) will be satisfied and the algorithm has a pseudopolynomial run time. The algorithm assumes a priori knowledge of the data streams. Each data stream consists of a set of tasks that are served by the first come first serve policy. Tasks are allowed to have their individual latency and synchronization requests. The algorithm does not assume that a task's computation requirement is proportional to its storage requirement. Finally, the algorithm is applicable when cache miss and context switch penalties are explicitly specified.
The rest of the article is organized as follows. In the next section, we explain the preliminaries, formulate the problem, and highlight our results. Section 3 shows the detailed storage minimization techniques, where we explain our dynamic programming-based optimal scheduling policy, discuss how it provides QoS guarantees with minimum storage requirement, and propose a simple but effective online heuristic for real-time applications. In Section 4, we present the global SOC design flow for synchronization and other QoS guarantees. We report the experimental results on a simulated real media benchmark in Section 5. We discuss the related work in Section 6, before concluding.
PROBLEM FORMULATION AND KEY RESULTS
In this section we first lay out the application model and QoS metrics that are used in our approach. We formulate three different but related problems: storage minimization with QoS guarantees, task completion maximization under storage constraints, and silicon area minimization with QoS guarantees. We explain the relevance of these problems and list the highlights of our solutions.
Application and Quality of Service Model
We assume that a single processor system receives data streams from multiple reliable end-to-end connections. Each data stream is an application that consists of a sequence of tasks. Associated with each task are its arrival time, execution time, latency, storage requirement, and synchronization specification with tasks in other applications. Formally speaking, the j th task T i j of the ith application A i has the following parameters:
-t i j : the arrival time of task T i j ; -τ i j : the execution time of task T i j with a given hardware configuration; -l i j : latency, that is, the maximum time that task T i j can stay in the system. At time t i j + l i j , the system will either complete task T i j or drop it. We define task T i j 's deadline to be d i j = t i j + l i j ; -m i j : the memory requirement to storage task T i j in the system; and -(n 
System Synthesis

•
77
The pair (n k i j , s k i j ) gives the QoS requirement on synchronization. Synchronization is only applicable to tasks from different applications; as tasks within the same application will be served by the first come first serve policy. To express how a task T i j from application A i should be synchronized with tasks from other applications, for each application A k , we need to specify which task (if any) needs to be synchronized with T i j , and how tight the synchronization is. In our representation, integers n k i j is the index of the task from A k to be synchronized and s k i j gives the level of synchronization. For example, considering the fifth task T 1,5 of application A 1 , if we have n 2 1,5 = 6 and s 2 1,5 = 4, then we know that tasks T 1,5 and T 2,6 must be synchronized so that their completion time cannot differ by more than 4 unit times.
We mention that there are other alternatives to represent the synchronization requirement among tasks such as tables or matrices. However, the applications' synchronization requirement must be self-consistent, as stated in the following synchronization assumption:
Assumption 2.1.1. (Synchronization assumption). The synchronization requirements of all applications are self-consistent. Suppose that task T i j needs to be synchronized with task T kl , where l = n The symmetry property enforces that any pair of tasks to be synchronized must have symmetric synchronization requirements. In the preceding example, we have imposed n 2 1,5 = 6 and s 2 1,5 = 4 onto task T 1,5 , which imply that tasks T 1,5 and T 2,6 are to be synchronized. Therefore, we need to associate the following symmetry constraint with task T 2,6 : n 1 2,6 = 5 and s 1 2,6 = s 2 1,5 = 4. The transitivity property enforces the triangular inequality on the level of synchronization among any three (or more) tasks. If we add the synchronization requirement between task T 1,5 and application A 3 : n 3 1,5 = 7 and s 3 1,5 = 8, then since both tasks T 2,6 and T 3,7 have to be synchronized with the same task T 1,5 , their completion time cannot be arbitrary. The transitivity assumption captures this by requiring T 2,6 and T 3,7 to be synchronized and sufficiently tight (s 2 3,7 ≤ s 2 1,5 + s 3 1,5 = 4 + 8 = 12). The preceding synchronization assumption, although we make it stronger than necessary for simplicity, is a necessary condition for the applications to be schedulable. It can be relaxed and still make the synchronization requirements self-consistent. In the earliar example, it is not mandatory to have n 2 3,7 = n 2 1,5 . Task T 3,7 can be synchronized with a task other than T 2,6 in application A 2 as long as the completion time of tasks T 1,5 , T 2,6 , T 3,7 , and T 2,n 2 3,7 meets their respective synchronization constraints.
We take the following general assumption on the system's service model:
Assumption 2.1.2. (Service assumption). The single processor system can start servicing a task as soon as it arrives and can free the memory occupied by • G. Qu and M. Potkonjak this task once it has been served. Tasks in the same application are served in the first come first serve (FCFS) fashion. Preemption is allowed, and we neglect the overhead for context switches between different applications.
Problem Formulation
Our goal is to illustrate the design methodology that optimizes traditional design targets and provides QoS guarantees. We choose the size of storage (on-chip memory) and silicon area as design optimization objectives and synchronization and task completion as QoS metrics. In particular, we formulate and study the following three problems:
Problem 2.2.1 (Storage minimization with QoS guarantees). Given N applications, find the minimum storage requirement for a single processor system such that the system can schedule all tasks in the applications and complete them under service assumption 2.1.2 and satisfy all the latency and synchronization constraints that meet the synchronization assumption 2.1.1. Problem 2.2.2 (Task completion maximization with limited storage). Given N applications and a single processor system with a fixed amount of storage, find an online scheduler to maximize the number of completed tasks under the latency and synchronization constraints.
Problem 2.2.3 (SoC design with minimum silicon area for QoS guarantees). Given N applications, design a system-on-chip (i.e., the type of processor core, configuration of I-cache, D-cache, and on-chip memory) with the minimum silicon area to provide guarantees to the applications' QoS requirements.
According to the knowledge about the incoming data streams (applications), we can classify problem 2.2.1 as (i) a problem with full knowledge where all the information of the applications is a priori. For instance, offline applications or periodic data streams with samples of one period; (ii) problem with partial knowledge where certain statistical information of the data stream is given; and (iii) problem with zero knowledge where nothing is known before the stream actually arrives. It is clear that lesser storage will be required as more information of the applications is known. However, without complete knowledge of the incoming data streams, it is unavoidable to drop tasks due to missed deadlines, violation of synchronization constraints, or lack of storage. The dropped tasks will not meet their timing constraints, and this implies that solutions may only exist for problem 2.2.1 with full knowledge. We restrict our discussion to this case and aim to find the minimum storage and a (offline) scheduler.
When full knowledge is not available, we relax problem 2.2.1 in the following sense: can we guarantee the completion of all tasks regardless of memory requirements (or assuming an infinite amount of storage is available)? This becomes the classical (online) scheduling problem. Even in this case, Baruah et al. [1994] show that, in general, any online scheduling algorithm can perform arbitrarily worse (in terms of the number of task drops) than an offline scheduler. Hence, no absolute guarantees of the timing constraints (latency, synchronization, etc.) can be expected. This leads us to the introduction of the second QoS metric, the task completion rate. We consider a given system processing realtime applications in problem 2.2.2, which seeks an online scheduler to minimize the number of dropped tasks.
Finally, we formulate the processor vs. memory trade-off in problem 2.2.3. As we discussed earlier, the processor configuration and on-chip memory compete for silicon area. Our goal here is to build the smallest chip to deliver QoS guarantees.
Key Results
We studied the above problems in this paper, and our key results are as follows:
(1) We solve problem 2.2.1 optimally when the full knowledge of the applications is available. Our dynamic programming-based algorithm finds the minimum storage requirement to provide the QoS guarantees as well as a task scheduler to achieve such guarantees with this minimum storage requirement. When the number of applications is fixed, both runtime and space complexity of the algorithm are polynomial to the number of tasks in each application. (2) For problem 2.2.2, we propose a parametric approach, where the distribution of the tasks is known and the online scheduler estimates the parameters of the distribution based on the history and scheduled tasks, assuming that history will repeat itself in the future. Simulation shows that our online scheduler can completely avoid task drops due to overflow with less than 10% extra storage compared to the optimal solution. (3) We propose a two-phase design methodology for problem 2.2.3: selection of hardware configuration and storage minimization via task scheduling. In the first phase, we investigate the combinations of different processor cores, I-cache, and D-cache setups. We define a dominance relationship among the possible SoC configurations and exclude the dominated combinations. In the second phase, for each nondominated configuration (Pareto points), we use our solutions to problems 2.2.1 and 2.2.2 to calculate the minimum storage requirement. Then the overall system performance is evaluated and the best configuration (with the smallest silicon area) is selected.
SCHEDULING TECHNIQUES FOR SYNTHESIS
In this section, we first present an optimal algorithm to solve problem 2.2.1 with full knowledge. The requirement of a priori information of all the applications can be met under several real-life circumstances such as offline applications and online periodic applications. We describe our approach in a simplified case and then discuss how this basic algorithm can be modified to handle the general case. We show the complete algorithm by a small example of two applications. We also present an online heuristic for problem 2.2.2.
The Basic Algorithm for Storage Minimization
We describe our area minimization algorithm for the simplest case in Figure 1 , where we have only two applications A 1 , A 2 . Each application has a task at the ) choices and we want to find one that requires the least amount of storage. The runtime of an exhaustive search will be O(4 T ). The dynamic programmingbased algorithm we develop has both runtime and space complexity O(T 2 ). The proposed algorithm has three steps: in step 1, a T × T instant memory requirement (IMR) table is built whose (i, j ) entry contains the storage requirement at time i + j when i and j slots have been assigned to A 1 and A 2 , respectively. In step 2, the T × T aggregate memory requirement (AMR) table is built based on the IMR table, where the (i, j ) entry contains the storage requirement such that a path from entry (0, 0) to (i, j ) is guaranteed with a large amount of storage. Note that the value of entry (T, T ) is the minimum storage we require. In step 3, a feasible scheduler (a scheduler is defined as a path from the upper left corner to the lower right corner in the IMR/AMR table, where the only legal moves are moving down or moving to the right) is found backward from entry (T, T ) to (0, 0). PROOF. AT time instant k = i + j , the system has received tasks from both A 1 and A 2 whose arrival time is ≤k. Since the system uses i slots for A 2 and J for A 1 , the memory occupied by the first j and i tasks from A 1 and A 2 can be freed according to the service assumption 2.2.2. Therefore, the entry (i, j ) of the IMR table should have a value of the total size of tasks that have already arrived but are not yet completed, which is:
Correctness of the Algorithm
Note that the instant memory requirement is path-independent. That is, i slots and j slots have been assigned to applications A 2 and A 1 respectively, but it does not matter to which each specific slot has been assigned, that is, which path the scheduler follows to reach entry (i, j ) from (0, 0).
LEMMA 3.1.2. (the AMR table) Equation (**) finds the minimum memory requirement UPTO time instant k
= i + j.
PROOF. The value AMR i, j in the entry (i, j ) of the AMR table guarantees (i)
there is no overflow at time k = i + j ; and (ii) there exists a path from entry (0, 0) to (i, j ) without crossing any entry with a value larger than AMR i, j .
From (i), AMR i, j has to be large enough to store the unfinished tasks,
Any path from entry (0, 0) to (i, j ) has to visit either entry (i − 1, j ) or entry (i, j − 1). To guarantee a feasible path, we must have
Therefore, a lower bound for the value AMR i, j is
THEOREM 3.1.3. The algorithm in Figure 1 determines both the minimum storage requirement, which is the value in the lower right corner of the AMR table, and a feasible scheduler.
PROOF. From Lemma 3.1.2, a storage in the amount of AMR TT is necessary. We now show that it is also sufficient by finding a scheduler requiring memory no more than this amount.
In step 3.2 of the algorithm, starting from the lower right corner (T, T ), we can move either up or to the left, whichever has entry with value ≤ AMR T,T . This is guaranteed by equation (**). Suppose now we are at entry (i, j ); also, from equation (**) we have either AMR i−1, j ≤ AMR i, j of AMR i, j −1 ≤AMR i, j (or both). Thus, at any time we are able to move upward or left to an entry with value no larger than the current value.
Define f (i, j ) = i + j for all 0 ≤ i, j ≤ T . In particular, we have f (0, 0) = 0, f (T, T ) = 2T , and any move in step 3.2 will decrease the value of f (. , .) at current entry by exactly 1. From above, we know that a move is always possible from any entry except (0, 0). Hence, after 2T moves, we will reach (0, 0) and this gives us a path from entry (0, 0) to (T, T ). From the construction of this path, it is clear that no entry on the path has value >AMR T,T . PROOF. Clearly from Lemmas 3.1.1 and 3.1.2, O(T 2 ) time and O(T 2 ) space are required to build the IMR/AMR tables. (Actually, one can easily combine equations (*) and (**) to build the AMR table directly. Here we use the intermediate IMR table to explain our approach. However, this does not change the complexity of the algorithm.) Once the AMR is built, the minimum storage requirement is known as AMR T,T , and finding a path needs 2T runtime as specified in Theorem 3.1.3.
Complexity of the Algorithm
Modifications for QoS Guarantees
In this section we briefly discuss how to modify the above algorithm to meet the QoS guarantees (e.g., latency, synchronization) for general applications (e.g., individual arrival time, latency, execution time) when there is a charge for context switching.
latency:
Adding individual latency constraint for each task decreases the amount of computation for building the IMR and AMR tables. From the arrival time and latency we have defined the deadline for a task (as the sum of arrival time and latency). When we build the IMR and AMR tables, whenever we detect that an entry may violate the deadline constraint of any task, there is no need to compute the value for this entry and we simply put a special mark on it. For example, if the first task of application A 1 has to be finished by 4, then we mark the entries (i, 0) for all i ≥ 4, since the deadline is missed if we reach these entries.
synchronization:
Let f 1,i , f 2,i be the finish time for the ith task in application A 1 and A 2 , as in Section 3.1, we say that A 1 and A 2 are k-synchronized if
holds for all i. Like latency, synchronization constraints reduce the number of entries to be filled in both IMR and AMR tables. For example, if we want a 1-synchronized solution for the problem in Section 3.1, it will be sufficient to fill the entries (i, i), (i − 1, i), (i + 1, i), since any of the other entries corresponds to the state where synchronization is violated.
execution time:
Recall that in the IMR table, entry (i, j ) is the memory requirement to store the tasks that have arrived but have not yet been completed. So when tasks need a different execution time, we only free the storage for the tasks in A 1 that can be finished in j unit time and those in A 2 that can be finished in i unit time. If preemption is not allowed, the index of the tables can be changed to the finish time of a task and may not be consecutive. In this case, the size of the table (and hence the complexity of the algorithm) will be O(N 1 · N 2 ) where N i is the number of tasks in application A i .
arrival time:
This case is similar to the case when tasks have individual execution time.
context switch: There will be a charge for context switching when we make a turn on the path from the upper left corner to the lower right corner. A path with the minimum number of turns can similarly be found by dynamic programming.
From the above discussion, we immediately have:
THEOREM 3.1.4. The algorithm in Figure 1 can be modified to solve the problem when each individual task has its arrival time, execution time, and requires latency and synchronization. Also, a scheduler with a minimum number of context switches can be found. Moreover, the complexity of the algorithm will not increase.
N applications:
If there are N applications instead of only 2, we have to build N -dimensional tables to find the optimal solution. For example, if we have three applications, we can extend equation (**) as follows and build the three-dimensional AMR table
This, of course, will not change the correctness of the algorithm but will increase the complexity. In particular, if the ith application has n i tasks and requires k i units of time, the complexity will be O( N i=1 k i ) in the preemption case and O( N i=1 n i ) when preemption is not allowed.
A 2-Application Example
We use a small example to illustrate the complete algorithm and compare it to the widely used earliest deadline-first (EDF) scheduling policy. Suppose that there are two applications, A and B, to be processed on a single processor. Each application consists of a sequence of tasks that request a certain amount of memory storage, CPU time, and latency constraints as shown in Table I . For simplicity, we assume each task takes exactly 1 unit CPU time, for execution and tasks A i and B i need to be synchronized as well as possible (please refer to the paragraph on synchronization in Section 3.2 for the definition of k-synchronized).
We first construct the instant memory requirement (IMR) table (Figure 2(a) ), where the entry (i, j ) indicates the total storage requirements at the end of time i+ j when i CPU units are assigned to B and j CPU units to A. The table is filled row by row and left to right by the recursive equation (*). For example, if we give one of the first 3 CPU units to B and the rest to A, entry (1, 2) will be filled by (aggregate memory request for both A and B by time 3) − (memory assigned to the Table I , we know that tasks A 0 , A 1 and B 0 have to be completed by this time, therefore, any scheduler that reaches entries (0, 4), (3, 1), or (4, 0) will fail to satisfy all the latency constraints. Then the aggregate memory requirement (AMR) table is built from equation (**) and the lower right corner indicates the minimum storage requirement to satisfy all these timing constraints (Figure 2(b) ).
A schedule is a path from the upper left corner (0, 0) to the lower right corner. At any entry, the schedule moves either one step to the right or one step down, and assigns the next CPU time to either A or B, respectively. In Figure 2 (b), a schedule that meets the deadlines of all tasks and achieves 3-synchronized is shown. There are only two context switches: at the start of slot 4, switching from application A to B and then switching back when B is completed at slot 10. Figure 2 (c) shows that 2-synchronized is also possible at the cost of more storage and context switches. The earliest deadline-first (EDF) policy [Liu and Layland 1973] always selects the task with the least deadline. We schedule A and B by EDF with different tie-break strategies. (A tie occurs when there are two or more tasks that have the same deadline.). In EDF 1 , a tie is broken to minimize the number of context switches; in EDF 2 , whenever there is a tie, we choose the one that occupies more memory. In this example, both EDF 1 and EDF 2 serve the two applications with a minimum storage requirement of 93 and achieve 3-synchronized as shown in Figure 2(d) . A comparison of the previous four schedulers is given in Table II. Our offline optimal scheduler (Figure 2(b) ) results in a path requiring only 74 memory units and 2 context switches, but achieves the same level of synchronization as both EDF schedulers. A 2-synchronized is also possible as shown in Figure 2 (c) at the cost of 10 more memory units over the 3-synchronized solution. However, it is still better than both EDF schedulers. One can easily see from Table II that scheduling policies can affect the QoS and better synchronization can be achieved at the expenses of extra storage and context switches.
On-Line Heuristics for Real-Time Applications
The previous dynamic programming-based algorithm requires a priori knowledge of all the applications. The computation also becomes expensive as the number of applications increases. Therefore, developing online schedulers becomes a necessity, particularly for real-time applications. However, because of the uncertainty of the arriving tasks and the scheduler's online nature, it becomes theoretically impossible to provide any kind of timing guarantee. For example, Baruah et al. [1994] show that in the general case, even without considering the storage, any online scheduling algorithm can perform arbitrarily worse (in terms of the number of task drops) than an offline scheduler.
Our goal here is to develop online scheduling policies such that the expected task drops will be acceptable with reasonably-sized on-chip storage. In the proposed heuristic, the processor collects information (frame size and execution time) from each application, predicts their statistical behavior and schedules current tasks based on the fact the history will be repeated. Apparently, the key challenge is how to estimate the data stream's statistical behavior accurately, • G. Qu and M. Potkonjak which, to a great extent, determines the effectiveness of our online scheduler. Fortunately, there have already been many discussions on the characteristics of MPEG video streams [Heyman et al. 1994; Bavier et al. 1998; Jabbari et al. 1993; Krunz et al. 1995; Lazar et al. 1993; Reininger et al. 1994; Krunz and Tripathi 1997] . We use the models proposed by Krunz and Tripathi [1997] and by Bavier et al. [1998] to predict the frame size and execution time for the variable-bit-rate MPEG video streams.
Empirical studies by Krunz and Tripathi [1997] indicate that (i) the scene length distribution can be appropriately fitted by an exponential (or geometric) distribution; (ii) the size of an I frame can be modeled by a sum of two random components: a scene-related component and an AR(2) component that accounts for the fluctuations within a scene; and (iii) the sizes of P and B frames can be characterized by two lognormal distributions with different parameters. Bavier et al. [1998] suggest prediction of the execution of MPEG frames based on frame size and frame type. They find that it is possible to construct a linear model of MPEG decode time with R 2 values of 0.97 or higher. The corresponding prediction is accurate to within 25% of the actual decode time, despite the great variability of MPEG decoding time.
Another important component of the online scheduler is the drop policy, that is, which task(s) the system will drop if there is not sufficient memory for all the tasks (overflow occurs). For each MPEG frame T , we assign a weight upon its arrival by the following formula:
where k is the number of frames that T needs to synchronize with, m and e are T 's frame size and execution (decode) time, M and E are the size and decode time for an average-sized frame of the same type as T , α I > α P > 0 are two constants and will be added to w(T ) if and only if frame T is an I frame or a P frame, respectively. This comes from the following observations: (i) if a task is dropped, all tasks that have to be synchronized with it will be forced to drop as well.
(ii) The I frame is the most important frame in an MPEG video stream. An I frame drop will make the decoding of its trailing P/B frames incorrect until the next I frame. A P frame drop will also affect the decoding of its neighbor B frames. α I and α P are two parameters that measure how many frames will be affected by the drop of an I/P frame. (iii) In case of overflow, the best way to free memory and to keep the task drop rate low is to drop tasks that occupy large amounts of memory or require long execution times. Figure 3 outlines the proposed online scheduler for a system with limited storage. It will be executed on the completion of a task or on the arrival of a new task, until all the tasks are scheduled. In step 2, we drop obsolete tasks and those that have no chance to be completed (those that need a longer execution time than the remaining time between the current time and their deadlines). In step 3, we assign weights to the new arrivals based on formula (***). This weight will not be changed during the task's entire lifetime. Whenever overflow occurs (or is predicted), tasks with the smallest weight will be dropped first in step 4. We apply a parametric method to predict the size and execution time of future frames using the distributions given by Krunz and Tripathi [1997] and Bavier et al. [1998] . On each new arrival, we get its size and execution time. We then verify and update the parameters for the corresponding model as shown in step 5. Finally, the scheduler selects one task to execute in step 6. The decision is made from the following information: (i) the timing constraints of all the unscheduled tasks; (ii) the occupied storage; and (iii) the prediction that the upcoming tasks will have the average size and require average execution time according to the most updated statistical models.
SYNTHESIS FOR QOS GUARANTEED SOC DESIGN
In this section we describe the global flow of the proposed synthesis system and explain the function of each subtask and how they are combined into a synthesis system. Figure 4 depicts the global flow of the proposed synthesis approach. For a set of given applications with their QoS requirements, our goal is to select a processor core, configure the I/D cache, and determine the size of on-chip memory to provide QoS guarantees. To accurately predict the system's performance for target applications, we employ the approach which integrates the optimization, simulation, modeling, and profiling tools. The synthesis technique considers each nondominated microprocessor core and competitive cache configuration and selects the hardware setup that requires minimum silicon area and meets all the QoS requirements of the applications.
Starting with a library of processor cores, I-cache, and D-cache configurations, we identify all the nondominated hardware configurations (Pareto points of performance and other design goals such as power and silicon area) based on the characteristics of the given applications. Then, for each such system setup, coupled with the detailed information of the applications, we determine the minimum storage requirement and a task scheduler to fulfill the QoS demand. The last step is to conduct an overall system performance estimation and pick the one that optimizes our design goal. For example, we chose the one that uses the smallest silicon area if the size is the primary design optimization target. A typical modern application-specific system-on-chip consists of microprocessor core(s), instruction cache, data cache, hardware accelerators, control blocks, on-chip memory, etc. Several factors combine to influence the system performance: processor performance, I/D cache miss rates and miss penalty, and clock speed. In particular, we compute the system performance by the following formula for cycles per instruction (CPI):
where f is the system clock frequency, and MIPS is a million instructions per second. Caches found in current embedded multimedia systems range from 4 KB to 32 KB. Although larger caches correspond to higher hit rates, they occupy a larger silicon area as well, resulting in a design trade-off, particularly when chip size is one of the primary design concerns. We consider only direct-mapped caches because higher cache associativity results in significantly higher access time. We experimented with two-way set associative caches, but they did not dominate in a single case. Cache line size is a variable in our experimentation. Its variation corresponds to the following trade-off: larger line size results in less hardware and area, together with a higher cache miss penalty. We used CACTI [Wilton and Jouppi 1996] as a cache delay estimation tool with respect to the main cache parameters: size, associativity, and line size. A sample of the cache model data is given in Table III. Data on microprocessor cores was extracted from the manufacturer's datasheets and the CPU Center Info web (http://infopad.eecs.berkeley. edu/CIC/). A sample of the collected data is presented in Table IV . The table presents the embedded microprocessor core operating frequency, MIPS performance, technology, and area. Given a fixed choice of processor core and caches, we can calculate the execution time for a given task. The execution time impacts the amount of memory needed to service the applications. Long execution time implies large on-chip memory to store the arrived but not executed tasks. The application-driven search for a core-based system requires usage of trace-driven cache simulation for each promising point considered in the design space. We attack this problem by carefully scanning the design space using search algorithms with sharp bounds and by providing powerful algorithmic performance techniques. We use the system performance and simulation platform based on SHADE, DINEROIII, and a custom analyzer [Kirovski et al. 1997] . We conduct an exhaustive search for all the processor cores, I-cache (range from 512 bytes to 32 KB), D-cache (range from 4 KB to 32 KB) and cache line sizes (from 8 bytes to 512 bytes). For each combination, we estimate the system performance and area. One processor type dominates another if it uses less area and results in the same or better system performance (in terms of CPI). The nondominated system configurations (Pareto points of performance and area in this case) are kept, and task scheduling will be performed on these configurations to identify the most area-efficient design. This approach is similar to the one in Hong et al. [1999] , in that they searched for the power-performance Pareto points.
We measure the chip size by the total silicon area occupied by the processor core, I/D cache, and on-chip memory. In general, a high-performance system has fast processing speed and thus requires less storage to provide the same QoS guarantees than low-performance systems. The silicon area for storage is proportional to the size of the on-chip memory; therefore a dominated system will always consume more silicon area than those that dominate it. Consequently, we only need to consider the nondominated system configurations. In the hexagon of Figure 4 , we apply the task-scheduling techniques that we discussed in Section 3 to determine the minimum on-chip memory size for each Pareto point to meet the application's QoS requirement. This has to be done for each different nondominated hardware configuration because task execution time varies with different hardware configurations. Once we have determined the storage requirement, the best design can be found by an overall system performance estimation, in particular, via the estimation of total silicon area.
SIMULATION RESULTS
We use simulated MPEG video streams as the target multimedia application and the microprocessor cores reported in Table IV as the pool for our hardware selection. We first explain the method to simulate the frame information of MPEG video streams. Then, for a reference system, we report the memory requirement by our scheduling technique to provide synchronization guarantees. A comparison with the EDF policy shows the effectiveness of our approach. Next, we briefly discuss the selection of hardware configurations. Finally, we present the results on the proposed online heuristics for real-time video streams.
Simulation of MPEG Streams
We test the proposed algorithms on MPEG video streams. Table V represents the sizes of the compressed frames of four MPEG-encoded video movies, where "Frames" is the number of total frames in the movie, "I-to-I" and "I-to-P" are the distances of the I-to-I frames and I-to-P frames, respectively. Standard MPEG encoders generate three types of compressed frames: I frames (intra-pictures), P frames (predicted pictures) and B frames (bidirectional predicted pictures). On average, I frames are the largest in size (because they are self-contained), followed by P frames and B frames. Krunz and Tripathi [1997] present a comprehensive model for MPEG video streams. This model captures the bit-rate variations at multiple time scales. Long-term variations are captured by incorporating scene changes, which are noticeable in the fluctuations of I frames. In particular, the frame sizes of different types of frames are simulated by three different submodels that are intermixed according to the group-of-pictures pattern. Statistically, the generated MPEG streams fit the empirical video and are sufficiently accurate in predicting the queuing performance for real video streams. From the parameters given in Krunz and Tripathi [1997] we simulate the above four video movies and the information of the generated frames is reported in Table VI . (The frame size of I-frames has a relatively large standard deviation because it is modeled by the sum of two random components.)
Offline Optimal Scheduler
To demonstrate the proposed scheduling technique's advantages in saving memory and providing synchronization guarantees, we conduct the following experiment for each of the above MPEG video movies: First, we apply our offline optimal scheduler to find the storage requirements under a virtual reference system configuration to achieve various levels of synchronization. This is repeated four times for no synchronization, 2-sync, 4-sync, and 8-sync, respectively. Then, to compare the storage requirements, we implement the earliest deadline-first (EDF) scheduling policy under the same reference system. We experiment with two EDF policies where a tie is broken by the largest memory task first and the least (remaining) execution time task first, respectively. However, they do not give solutions significantly different in terms of storage requirement. The offline optimal storage requirements are normalized with respect to those for the EDF policy as shown in Figure 5 . The EDF policy picks the most urgent task first and a tie is normally broken randomly. Therefore, it cannot guarantee any level of synchronization unless the synchronization requirement coincides with the task's deadline. In our simulation, the solution by EDF is 12-15-synchronized. Our algorithm, which is based on Figure 1 with simple modification for QoS constraints as discussed in Section 3.2, finds the minimum storage requirement for a given hardware configuration. Although the memory saving is not significant, about 6% when no synchronization requirement is specified and less than 4% on average to deliver synchronization, the key contribution of this algorithm is the guaranteed QoS. For each of the above MPEG video movies, our scheduler finds a way to provide services that are 8-, 4-, and 2-synchronized (except in the last two movies, 2-sync schedules do not exist due to the tight timing constraint). In addition, one can see the clear trend: better synchronization needs more storage. Finally, in all cases and regardless of the synchronization guarantees, our scheduler requires less memory than EDF. This is a coincidence and one can easily find examples where EDF is the most memory-efficient scheduling policy.
Selection of Hardware Configuration
We explain how we determine the nondominated hardware configurations, that is, the Pareto points of silicon area and system performance. We measure the silicon area by the size of the processor core and the area needed by the I/D cache. The system's performance is measured by cycles per instruction (CPI), as we discussed in Section 4. We say one hardware configuration (core type and I/D cache setup) dominates another if it achieves at least the same performance with less or the same silicon area. It is sufficient to consider only the nondominated system configurations for the most area-efficient design to deliver synchronization guarantees. We consider 10 different processor cores that are popular for embedded systems. Table IV gives their technology, area, clock frequency, and MIPS. For each processor core, we investigate various I/D cache configurations. In our simulation, the size of I-cache and D-cache varies from 4 KB to 32 KB and cache line size from 8 bytes to 512 bytes. We estimate the cache performance, in particular, the area and miss rate, by the online cache design tool (http://arith.stanford.edu/tools/cachetools.html). For the sample MPEG frames, Table VII lists all the nondominated system configurations when we fix the size of the I-cache and D-cache at 4 KB each, but allow the cache line size to be changed between 8 bytes and 64 bytes. Figure 6 reports details with five Pareto points connected by the thick line.
Online Heuristics
We have some general discussion on the design of online heuristics in Section 3.4. We implement a simple online scheduler based on the pseudo-code in Figure 3 . The performance of this online algorithm is evaluated by simulation, and we report the results in terms of memory size and task drop rate.
We assume that certain statistical information about the frames, such as the I-to-I distance, I-to-P distance, average size, and standard derivation for each type of frame discussed in Tables V and VI, are known a priori. Then, based on such information, we generate a sample MPEG stream and perform the offline optimal scheduler to determine the (expected) size of storage to satisfy the synchronization requirements. Using the same amount of storage, we apply the online heuristics to process the benchmark MPEG video stream at real time. Note that a frame will be dropped by the online scheduler if it is not possible to process the frame before its deadline (adjusted according to its synchronization constraints). Another scenario in which to drop a frame is when there is not sufficient memory left to store the newly arrived data stream. However, it is not always necessary to drop the newly arrived frame. The drop policy considers each frame's size, execution time, relevance to other frames, and a rough prediction to the next arrival to decide which frame is to be dropped. We then increase the size of the on-chip memory and repeat the same simulation. The number of dropped frames and the reasons for such drops are traced during the simulation. Table VIII reports these results from the simulation on samples of 261 frames for each of the above four MPEG movies.
In Table VIII , we choose the on-chip memory to be the same as that determined by the offline optimal scheduler (which guarantees the completion of all frames), 5% larger, 10% larger, and 20% larger than that size, respectively. For each case, we simulate the proposed online heuristic and compare its performance, in terms of the frame drop ratio, to an offline EDF policy. When we use the same size, the EDF policy sees 7% of the frames being dropped due to the burstiness of the MPEG data stream. The online heuristics suffer a frame drop rate at an average of 9%. However, the frame drop rate can be (greatly) reduced by increasing the size of the storage. As the simulation results indicate, the online heuristic completely eliminates the frame drops due to the lack of storage and becomes competitive to the offline EDF scheduler with 10% extra storage. If we continue to increase the size of the storage, as shown in the last column, the scheduler cannot reduce the frame drop further. This is because at this point, all the frame drops are caused by missing the application's timing constraints (either latency or synchronization). Using more storage alone will not help complete more frames unless we increase the computing power at the same time. Finally, we mention that our online heuristic considers both latency and synchronization constraints, whereas the offline EDF considers deadline/latency alone and does not give any synchronization guarantees.
RELATED WORK
In this section we survey the previous efforts in the following relevant areas: delivery of synchronization guarantees in multimedia applications, QoS modeling and measurement, system design for QoS, and storage minimization by task scheduling.
The problem of how to deliver such synchronization guarantees has received a lot of attention from communication and multimedia societies. It has been studied as an operating system delivery problem, a physical disk modeling problem, a physical data organization problem, a conceptual database problem, and as a real-time CPU scheduling problem [Anderson and Homsy 1991; Baqai et al. 1996; Chen and Little 1996; Herman et al. 1998; Liu and Layland 1973; Panzieri and Roccetti 1997] . In particular for MPEG streams, Cen et al. [1995] provide the lip synchronization in an MPEG player by simultaneously displaying audio and video frames with the same sequence number. Qiao and Nahrstedt [1997] design a fine-grain lip-sync algorithm that first estimates the audio playback and the video decoding times and then adopts a selective dropping policy for each type of I, P, or B frames.
Synchronization is not the only metric for the quality of service (QoS) of multimedia applications. How to measure the QoS has been a fundamental and challenging problem by itself. The quality of the complex real-time, distributed multimedia services should be application-specific and user-dependent; thus, it is hard to find an explicit one-fits-all definition for QoS. Little and Ghafoor [1990] define the QoS for multimedia communication as a combination of speed ratio, utilization, average delay, maximum jitter, maximum bit error rate, and maximum packet error rate. Lawrence [1997] discusses the metrics based on the QoS attributes of timeliness, precision, and accuracy that can be used for system specification, instrumentation, and evaluation. Kornegay et al. [1999] define QoS of the implementation of an application as a function of the properties of the application and its implementation as observed by the user and/or the environment. Cruz [1995] and Sariowan et al. [1995] introduced the arrival curve and service curve in the context of packet-switched networks. From these curves, one can view QoS in terms of backlog, transmission delay and throughput. The problem of satisfying service guarantees becomes a scheduling problem to meet the backlog and latency constraints. Rajkumar et al. [1997] present an analytical approach to satisfying multiple QoS dimensions in a resource-constrained environment. However, the real-time nature of the multimedia applications with human beings as the end users makes synchronization one of the most important QoS metrics and a primary design concern of multimedia systems. We see that synchronization has been discussed in both of the recently proposed multimedia standards: the MHEG (Coded Representation of Multimedia and Hypermedia Information) and the HyTime (Hypermedia Interchange Standard) [ISO/IEC 13522-1 standard; Markey 1992] .
Systems design traditionally focuses on the optimization of objectives such as the minimization of power consumption, area, cost and the maximization of throughput, testability, and scalability. How to provide such application-specific QoS guarantees has not received the attention that it deserves in the system design community. Kornegay et al. [1999] illustrate the interaction between QoS and synthesis and compilation tasks and discuss the synthesis issues related to the design of QoS-sensitive systems. Qu and Potkonjak [2000] consider the issue of QoS system design with a focus on how to minimize energy consumption with the QoS guarantees. Key to their techniques is dynamical voltage scaling. A similar idea has been applied to finding the minimum buffer size that maximizes the energy saving for multimedia applications [Im et al. 2001] and to minimizing energy consumption under a limited-size buffer to deliver timing QoS guarantees on battery-operated systems [Manzak and Chakrabarti 2001] .
Several task-scheduling approaches have been proposed to take the memory issues into account. Research in the context of real-time scheduling suggests that a proper scheduler with certain knowledge of the upcoming applications requires less storage [Chen and Little 1996; Liu and Layland 1973] . Ade et al. [1994] give upper bounds on the minimum buffer memory requirement for certain synchronous applications. However, their upper bounds are quite loose, since they target the minimum buffer memory for all valid schedules, and those that minimize memory may require much less buffer memory than this upper bound. Murthy and Bhattacharyya [1999] develop a buffer-merging technique to reduce data-buffering requirements by overlaying buffers in the synchronous dataflow graph. They report a 60% reduction in buffering memory consumption. Most recently, Maestre et al. [2001] presented a general framework for reconfigurable computing, in which task scheduling and context allocation problems have been studied to prune the system design space and to minimize memory fragmentation.
CONCLUSION
In this article we address the problem of how to design a system-on-chip with minimum silicon area that meets the QoS requirements for real-time • G. Qu and M. Potkonjak multimedia applications. We selected the timing constraints (synchronization and latency) as the measure for QoS and proposed an algorithm to determine the minimum storage and feasible schedule for a given hardware configuration to provide QoS guarantees for given applications. We proposed a two-phase design methodology for selecting hardware configuration and storage minimization. For a fixed hardware configuration, our storage minimization algorithm provides the optimal solution to meet all the QoS requirements. We show that better synchronization can be achieved at the cost of more storage. Experiments on simulated MPEG movies demonstrate that our offline scheduler saves storage over EDF and provides synchronization, the online heuristic is effective and efficient in reducing (or completely avoiding) frame drops due to lack of storage.
