Abstract
Introduction
Multimedia applications have intrinsic requirements on deadlines to process the incoming data (latency), and coherent playout of different types of data (e.g. synchronization among text, image, audio, and video or multiple video/audio streams). The timing relationship among the interacting media (synchronization) and within each media (latency) 1 is one of the most important metrics for the quality of service (QoS) provided by the system that supports such applications, and must be satisfied at the presentation 2 time. For example, the lip-sync of audio and video usually requires 25 or 30 synchronization points per second. Cen et al. [1] provide the lip synchronization in a MPEG player by simultaneously displaying audio and video frames with the same sequence number. Qiao and Nahrstedt [7] design a fine-grain lip-sync algorithm that first estimates the audio playback and the video decoding times and then adopts a selective dropping policy for each type of I,P, or B frames. Synchronization has been discussed in both of the recently proposed multimedia standards [3, 6] .
Systems design traditionally focuses on the optimization of objectives such as power, cost, area, performance. As embedded CPU cores become increasingly popular in VLSI systems and multiple embedded cores have been integrated on a single silicon, system designers have to implement systems using real-time design techniques to meet the design constraints. Memory hierarchies, in particular caches and on-chip memory, play a very important role in achieving high performance in modern RISC embedded cores. One may use a fast CPU and large caches to improve the performance, however, this requires a large silicon area and restricts the on-chip memory on a fixed silicon area. Research in the context of real-time scheduling suggests that a proper scheduler with certain knowledge of the upcoming applications requires less storage [2, 5] . How to provide QoS guarantees has not received the attention that it deserves in the system design society. In this paper, we address the problem of systems design with these traditional optimization targets and QoS guarantees, in particular, how to take into account synchronization and latency requirements during the system-on-chip (SoC) design.
We propose a two-phase design methodology: (i) selection of hardware configuration and (ii) storage minimization via tasks scheduling. Different processor cores, combined with different sizes of I-cache and D-cache, have different performance. In the phase of hardware configuration selection, we exclude a combination if it does not produce better performance but occupies more area than another one. For each of the remaining hardware configurations, we determine the minimal storage requirement to satisfy the QoS guarantees by finding the optimal scheduling. Then the systems are evaluated and the one with the best performance is chosen based on the optimization targets. We develop an off-line pseduo-polynomial scheduling policy, which is provably optimal in minimizing the storage under the timing constraints.
A Motivational Example
We use a small example to illustrate the importance of scheduling discipline. Suppose there are two applications, A and B, to be processed on a single processor. Each application consists of a sequence of tasks that request certain amount of memory storage, CPU time and latency constraints as shown in Table 1 For simplicity, we assume each task takes exactly 1 unit CPU time for execution. The processor can start executing a task on its arrival and free the memory occupied by this task as soon as it is finished. Tasks in the same application follow the first come first serve (FCFS) strategy. Let t Ai ; t Bi be the finishing time for the i-th tasks of A and B. We say A and B are k-synchronized if jt Ai , t Bi j k for all i. We want to schedule the tasks such that no deadline is missed, a pre-defined level of synchronization is achieved and the memory requirement is minimized.
In order to solve the problem, we first construct a storage requirement table (Figure 1 ), where the entry (i; j) indicates the total storage requirements at the end of time i+j when i CPU units are assigned to B and j CPU units to A. An entry marked by "X" indicates a situation that at least one of the deadlines is missed. For example, by time 4, from Table   1 , we know that tasks A 0 ; A 1 and B 0 have to be finished, therefore, any scheduler that reaches entries (0,4), (3,1), or (4,0) will fail to satisfy all the latency constraints.
A scheduler is a path from the upper left corner (0,0) to the lower right corner. At any entry, the schedule moves either one step to the right or one step down, and assigns the next CPU time to either A or B respectively. The earliest deadline first (EDF) policy [5] always selects the task with the least deadline. In E D F 1 , a tie is broken to minimize the number of context switches, in E D F 2 , whenever there is a tie, we choose the one that occupies more memory. In this example, both E D F 1 and E D F 2 serve the two applications with a minimal storage requirement of 93 and achieve 3-synchronized as shown by the "solid arrow path" and "dotted arrow path" in Figure 1 . Our off-line optimal algorithm uses dynamic programming to find the minimal storage requirement at any instant time and then finds one scheduler. In this case, an optimal scheduler is the path consisting of entries in the Bold Italic font using only 74 memory units and achieves the same synchronization. 2-synchronized is also possible as represented by the circled entries. A comparison of the above 4 schedulers is given in Figure 2 shows a typical application specific system-onchip which consists of microprocessor core(s), instruction cache, data cache, hardware accelerators, control blocks, on-chip memory, etc. Several factors combine to influence the system performance: processor performance, I-cache and D-cache miss rates and miss penalty, and clock speed. In particular, the system performance is computed using the following formula for cycles per instruction (CPI): CPI= f MIPS +(Miss Rate I-Cache +Miss Rate D-Cache ) Miss Penalty , where f is the system clock frequency, and MIPS is million instructions per second.
Background and Problem Formulation

Architecture and Hardware Model
Caches typically found in current embedded multimedia systems range from 4KB to 32KB. Although larger caches corresponds to higher hit rates, they occupy a larger silicon area. Since higher cache associativity results in significantly higher access time, we consider only direct-mapped caches. We experimented 2-way set associative caches, but they did not dominate in any single case. Cache line size was a variable in our experimentation. Its variation corresponded to the following trade-off: larger line size results in less hardware and area together with higher cache miss penalty. We use CACTI [9] as a cache delay estimation tool with respect to the main cache parameters: size, associativity, and line size. A sample of the cache model data is given in Table 4 . The performance and area data for sample processor cores.
Application and Quality of Service Model
We assume that we receive applications from a reliable end-to-end connection. Each application consists of a set of tasks, each task has its arrival time, latency, execution time (for a given hardware configuration), storage requirement and synchronization specification with the tasks in other applications. Formally, the j-th task A ij of the i-th application A i has the following parameters: t ij : the arrival time ij : the execution time with a given hardware configuration l ij : the latest time to finish A ij after its arrival m ij : the memory requirement n k ij ; s k ij : the synchronization of A ij and the task in the k-th application, i.e., the finish time of A ij and A kn k ij cannot differ by more than s k ij unit time 3 .
On the service side, we assume that tasks within the same application are processed in the first come first serve (FCFS) fashion, and there is a charge for the context switch among different applications. The memory occupied by a task can be freed as soon as this task have been executed. The execution time for a task depends on the hardware configuration, for example, a fast processor core and large cache with low miss rate provide short execution time.
Problem Formulation and Key Results
We formulate the problem as follows: Given a set of applications with their computation, storage, latency and synchronization requirements, determine a system-on-chip (i.e., the type of processor core, sizes of I-cache, D-cache and on-chip memory) with the minimal silicon area such that all the application requirements are satisfied.
We developed a dynamic programming-based algorithm that finds the minimal on-chip storage requirement and a feasible scheduler to service the applications within their timing constraints (latency and synchronization) in pseudopolynomial time. The algorithm assumes a priori knowledge of the data streams and tasks within the same application are scheduled following the FCFS policy. However, every task can have its individual latency and synchronization requests; we do not assume that computation load is proportional to data size; finally the algorithm is also applicable when a context switch penalty is explicitly specified.
We define a dominance relationship among the possible SOC configurations and select the one that requires minimal silicon area from all the non-dominated configurations. This methodology is valuable in making early design decisions in silicon area allocation among processor, cache, memory and others.
Global Synthesis Flow for QoS Guarantees
In this section, we describe the global flow of the proposed synthesis system and explain the function of each subtask and how they are combined into a synthesis system. The goal is to choose the configuration of processor, Icache, D-cache and determine a task schedule with minimal storage once the hardware configuration is fixed. To accurately predict the system's performance for target applications, we employ the approach which integrates the optimization, simulation, modeling, and profiling tools. The synthesis technique considers each non-dominated microprocessor core and competitive cache configuration, and selects the hardware setup which requires minimal silicon area and meets all the QoS requirements of the applications. Figure 3 depicts the global flow of the proposed synthesis approach. Starting from a pool of processor cores, Icache, and D-cache configurations, we identify all the nondominated hardware configurations based on the characteristics of the given applications. Then for each such system setup, coupled with the detailed information of the applications, we determine the minimal storage requirement and a task schedule to fulfil the QoS demand. Finally we conduct the system performance estimation, and select the one that optimizes our design goal.
Synthesis Techniques
Resource Allocation
The objective in this phase is to find an area-efficient system configuration since area is our primary optimization target.
We conduct an exhaustive search for all the processor cores, I-cache (range from 512B to 32KB), D-cache (range from 4KB to 32KB) and cache line sizes (from 8B to 512B). For each combination, we estimate the system performance and area. One processor type dominates another if it uses less area and results in the same or better system performance. The non-dominated system configurations are kept and task scheduling will be performed on these configurations to identify the most area efficient design.
For each competitive hardware configuration, since the silicon area for storage is proportional to the size of the onchip memory, our goal is to find the minimal amount of storage that meets the latency and synchronization constraints for a given set of applications. Once the storage requirement is determined, we can do the system performance estimation and in particular calculate the total silicon area.
Finally, a task scheduler is required to schedule the tasks such that neither deadline miss or storage overflow occurs. We argue that this cannot be done unless the hardware configuration is fixed, because the execution time for a task varies with different hardware configurations.
The Basic Storage Minimization Algorithm
We describe our area minimization algorithm for the simplest case in Figure 4 , where we have only two applica- That is, i slots and j slots have been assigned to applications A 2 and A 1 respectively, but it does not matter to whom each specific slot has been assigned. Equation (**) finds the minimal memory requirement UPTO time instant k = i + j. It has to be large enough to store the unfinished tasks ( I M R ij ), and guarantees a feasible path to entry i; j from either left ( AMR i,1;j ) or above ( AMR i;j,1 ).
In step 3, any marked entry has either its left entry or the entry above or both with value AMR T T , which is the minimal storage requirement. This is guaranteed by equation (**). Once the AMR table is built 4 , the minimal memory requirement is given as AMR T T and a feasible scheduler (a path from (0,0) to (T,T) in the AMR 
Modifications for QoS Guarantees
In this section, we briefly discuss how to modify the above algorithm to meet the QoS guarantees (e.g. latency, synchronization) for general applications (e.g. individual arrival time, latency, execution time) when there is a charge for context switching. 
Experimental Results
We test the proposed algorithms on MPEG video streams. Standard MPEG encoders generate three types of compressed frames: I frames (intra-pictures), P frames (predicted pictures) and B frame (bi-directional predicted pictures). On average, I frames are the largest in size (since they are selfcontained), followed by P frames and B frames. Krunz and Tripathi [4] present a comprehensive model for MPEG video streams. In particular, the frame sizes of different types of frames are simulated by three different sub-models which are intermixed according to the group-of-pictures pattern. Statistically, the generated MPEG streams fit the empirical video and are sufficiently accurate in predicting the queueing performance for real video streams. We simulate four video streams using the parameters provided in [4] and the information of the generated frames is reported in Table  5 . (The frame size of I-frames has a relatively large standard deviation because it is modelled as the sum of two random components). The algorithm in Figure 4 finds the minimal storage requirement for a set of a priori applications. For each of the above MPEG video movies, we find the storage requirements for both the EDF policy and our off-line optimal sched- uler. In EDF, a tie is broken randomly 5 . Our off-line algorithm has been applied four times with no synchronization, 2-sync, 4-sync, and 8-sync. The off-line optimal storage requirements are normalized with respect to that for the EDF policy as shown in Figure 5 . The key feature of these solutions is that they have synchronization guarantees and the trend is clear: better synchronization needs more storage.
In all cases, the off-line EDF policy, which achieves 12 15-synchronized, requires more storage.
Different processor cores use different amount of silicon area and deliever different performance (see Table 4 ). We investigate each processor core with I-cache, D-cache (size varies from 4KB to 32KB) and cache line (size from 8B to 512B) setups. For the sample MPEG frames, we conclude that the LSI TR4101 core with a 4KB I-cache and 4KB Dcache have the best performance in terms of silicon area. Details are reported in Figure 6 .
Conclusion
In this paper, we address the problem of how to design system-on-chip with minimal silicon area that meets the QoS requirements for real-time applications. We select the timing constraints (synchronization and latency) as the measure for QoS and propose an algorithm to determine the minimal storage and feasible schedule for a given hardware configuration to provide QoS guarantees for given applications. We propose a two-phase design methodology of hardware configuration selection and storage minimization. For
