We developed a pipelined scheduling technique of functional hardware and software modules for platformbased system-on-a-chip (SoC) designs. It is based on a modified list scheduling algorithm. We used the pipelined scheduling technique for a performance analysis of an MPEG4 video encoder application. Then, we applied it for architecture exploration to achieve a better performance. In our experiments, the modified SoC platform with 6 pipelines for the 32-bit dual layer architecture shows a 118% improvement in performance compared to the given basic SoC platform with 4 pipelines for the 16-bit single-layer architecture.
I. Introduction
System-on-a-chip (SoC) can be defined as a complex IC that integrates the major functional elements of a complete endproduct into a single chip or chipset. In general, SoC design incorporates at least one programmable processor, on-chip memory, and accelerating functional modules implemented in hardware. It also interfaces with peripheral devices, and/or the real world, and encompass both hardware and software components [1] .
The short life cycle and diversification of consumer electronics have placed a premium on getting products to market as quickly as possible. Therefore, it is now more important to design a system that meets the target specifications on time than to design a solution with better performance at a cost of delaying the introduction of a product to the marketplace.
Platform-based design (PBD) is the best-validated industrial approach for achieving high reuse in SoC design and the lowest risk in derivative design. Beyond the reuse of individual IP blocks, PBD reuses complex architectures of hardware and software components organized for a specific application [2] . PBD can decrease the overall time-to-market for the first products and expand the considerably early-delivering opportunities of derivative products.
PBD is a hierarchical design methodology that starts at the system level. PBD achieves its high productivity through extensive, planned design reuse. Productivity is increased by using predictable, pre-verified blocks that have standardized interfaces. The better planned the design re-use, the less changes are made to the functional blocks [3] , [4] .
Several platform types have emerged nowadays as a result of the evolution of platform-based design. Table 1 summarizes four types of platforms [5] . Note, however, that the boundaries between these types can blur as providers expand their reach. In this paper, we focus on the processor-centric and communication-centric platforms that require adding specific hardware elements to model each of the applications using them. Figure 1 gives a general platform architecture for processorcentric and communication-centric platforms. It has two master modules, a processor and direct memory access controller (DMAC), and three slave modules (shared memory and hardware modules) connected via the communication network. The processor performs software functions, initiates hardware modules (HW setup) and controls DMAC (DMA setup) for data transfer between the shared memory and hardware modules. The communication network can be a single-layer or multi-layer on-chip bus, or a packet or circuit switch network. Transformative applications such as JPEG images and MPEG video compression-decompression algorithms should be cost effective, have high performance, and be flexible in order to succeed in the market. As a result, most of them are implemented by an SoC platform that utilizes an off-the-shelf software (SW) processor core and custom hardware (HW) coprocessors. The SW processors reduce the cost of the system and provide flexibility. The custom HW coprocessors implement the computation-intensive components of the application and enhance the performance of the system [6] , [7] .
HW-SW co-design techniques can be used for designing such SoCs. In HW-SW co-design, the application specification is transformed into communicating HW and SW components, which comprise a platform that exhibits the desired behavior and satisfies the performance constraints. HW-SW co-design consists of two basic design stages: partitioning the application specification into HW and SW components, and scheduling the execution order of these components. Figure 2 shows a block diagram of functional modules for an MPEG-4 video encoder [8] , [9] . The encoder has two-step motion estimation (MEC for coarse, and MEF for fine), motion compensation (MC), motion vector to motion vector difference (MVMVD) calculation, DCT and quantization (DCTQ), inverse quantization and inverse DCT (IQIDCT), reconstruction (REC), header/texture variable length coding (HVLC/TVLC), and stream production (SP) modules. It encodes video frames coming from the "current frame" and outputs the encoded stream through SP. The "reconstructed frame" is generated to exploit temporal redundancy between frames. The encoding procedure is performed based on macro block data of 16 × 16 pixels. Table 2 shows the execution cycles for major functional HW and SW modules. We used a register-transfer level (RTL) simulator for HW cycles and an ARMulator with ARM7TDMI model for SW cycles. In this table, the 'cycles' column indicates the maximum number of cycles required to process a macro block during the simulation of 300 frames of the CIFsize (352 × 288 pixels) foreman stream.
To encode fifteen frames of CIF size (22 × 18 MBs) per second with 27 MHz, it should process an MB in 4,500 cycles. However, based on Table 2 the longest data path requires about 8,300 cycles at 27 MHz for execution without counting the data transfer cycles between functional modules. To implement this application on a platform as shown in Fig. 1 satisfying the performance specification, we have to implement it in a pipelined architecture. Although a lot of work has been done for the fine-grained synchronous pipeline design, little has been done for a coarse-grained asynchronous pipeline design. More detailed descriptions of previous works on coarse grained and fine grained pipeline designs can be found in [6] . For efficient implementation of the pipelined architecture and architecture exploration, we developed a pipelined scheduling technique.
In this paper, we developed a pipelined scheduling technique of hardware and software modules for platform-based SoC design. Then, we applied it to an MPEG4 video encoder application for performance evaluation and architecture exploration.
II. Pipelined Hardware and Software Scheduling
Transformative applications are dominated by dataflow operations with few control-flow operations. Also, they can be easily broken down into distinct functional tasks at a coarse level of granularity. Each task is computation-intensive and internally strongly interconnected, having a sparse external communication. Therefore, transformative applications can be specified by a data dependency-based task-graph format. Note that these applications are iterative in nature and execute repeatedly over different sets of input data. Hence, they are good candidates for pipelined designs.
Platform Architecture
We implement the application on an SoC platform that consists of one single SW processor, one shared memory, one DMAC, and several dedicated HW modules, as shown in Fig.  3 . The SW processor is a uniprocessing system and has a local memory for SW execution. Each HW module has its own buffer memory for efficient pipelined operation. HW modules support the concurrent execution of multiple HW tasks. The DMAC is controlled by the SW processor and controls the data transfer between the shared memory and HW buffer memories. The shared memory and SW local memory are single port memories. HW and SW tasks communicate with each other through the shared bus. We consider single-layer and multi-layer shared buses as the communication network in this paper.
Modeling Task Graphs and Resource-Conflict Graphs
A given application can be specified as a directed acyclic graph G(V, E), where V is the set of tasks with the execution cycles and E is the set of dependency arcs. Major tasks are functional HW and SW tasks. For bus-based platforms, data transfers controlled by DMAC (DMA transfer), HW setup, and DMA setup can also be modeled as tasks. This will give the scheduler further flexibility to improve the performance of the scheduling result.
Execution cycles of tasks can be estimated by simulation, but it cannot cover all the input data. For SW tasks, computation cycles can be estimated from a complexity analysis of the algorithm. Because HWsetup or DMAsetup tasks performed by the SW processor consist of a register setting and calculation of the register values, their computation cycles can be computed by the number of registers and bus characteristics. DMA transfer cycles can be estimated with the number of data to be transferred and the specifications of the DMAC and memories. Table 3 summarizes the task types according to the usage of platform resources. Since HW modules support concurrent operations, they can be performed any time when all the registers are set by the HW setup. All the task types that use a common resource cannot be performed at the same time. Any tasks that have checks in common in a column cannot be performed at the same time. For example, SW tasks and DMA setup schedules cannot be overlapped even though they are assigned different pipelines.
These relations of task types can be represented as a resource-conflict graph C(T, R), where T is a set of vertices representing task types and R is a set of edges representing a resource conflict. Figure 3 shows a resource-conflict graph of Table 3 . In this graph, tasks which have an edge between them cannot share the scheduling time. 
Problem Definition
Given an application specified as a task graph G(V, E) and resource-conflict graph C(T, R) with a pipeline initiation interval as the performance constraint, find a feasible pipelined schedule and the minimum number of pipelines for executing the task graph.
The pipeline initiation interval is the time difference between the start of two successive iterations of the steady state of the pipeline. Usually, this value is calculated from the specification of the application.
Pipelined Scheduling Algorithm
Since resource constrained scheduling is a non-polynomial (NP) complete problem, pipelined scheduling is also NP complete [10] . To achieve optimal solutions of the pipelined scheduling problem in polynomial time, we developed a pipelined scheduling algorithm as shown in Fig. 4 by modifying a list scheduling algorithm. In this figure, 'head' and 'tail' are virtual start and end modules with 0 execution cycles. Add_Candidates(queue, m) adds candidate vertices to the queue. Candidate vertices are vertices whose predecessor vertices are all scheduled. When it adds a candidate, it sorts the candidate vertices in descending order of priority. The priority is calculated from a combination of slack, task type, and userdefined priority.
Pop(queue) returns the first vertex from the queue. It has the most priority among the candidates in the queue.
Find_Schedule(m) finds a start cycle of m such that no resource-conflict violation occurs. Each m has three types of information for its scheduling:
1. start cycle: absolute start cycle of scheduling 2. pipeline cycle = (start cycle) % (initiation interval) 3. pipeline number = (start cycle) / (initiation interval) Set_Schedule(m) marks the scheduled information of its resource type using its "pipeline cycle" and "execution cycle" so that scheduling other modules may not generate resource conflicts.
This scheduling technique is flexible in that the scheduling results can be controlled by giving a user-defined priority of tasks and pre-scheduling of some tasks with Set_Schedule(m).
III. Experimental Results
We used the pipelined hardware and software scheduling technique to the application given in Fig. 2 . First, we scheduled the application for a single-layer 16-bit bus-based platform as shown in Fig. 5 . In this case, the resource conflict-graph in Fig.  3 can be used. DCTQ/ IQIDCT Figure 6 shows a scheduling result for the single-layer busbased architecture in Fig. 5 . The scheduling result includes hardware modules (HW), software modules (SW), DMA transfer (DMA), and HW/DMA setup (FW). SW modules are header variable length coding (HVLC), intra refresh (IR) decision, rate control operations (PreRC and PostRC), precalculations for DMA transfers, and post processing for HW modules. FW modules are named with HW modules or DMA transfers followed by "Init." MEC has two buffers named SWC0 and SWC1. Also, MEF/MC has two buffers named SWF0 and SWF1, and SWF1 has three regions for luminance (Y) and chrominance components (U/V). In this case, the bus usage is about 75%.
Then, we explored the platform architecture to improve the performance by using the developed scheduling technique. Because the bus usage is very high, we tried two variations of the architecture: bus-width expansion and bus partitioning.
Bus-width expansion can reduce FW (HW setup and DMA setup) cycles and DMA transfer cycles. FW cycles can be reduced as much as the bus-width expands. However, DMA transfer cycles are dependent on the SDRAM features and DMAC characteristics. By analyzing the two characteristics, we obtained the reduction factor of DMA transfer cycles. In our case, it is 0.67 for doubling the bus-width. By analyzing the data transfer within the bus system, we partitioned the bus into two buses. One is to control the HW modules and DMAC and the other is to transfer data between HW modules and SDRAM. Figure 7 shows a dual-layer bus-based platform, which is implemented by partitioning the shared bus given in Fig. 5 . In this case, the resource-conflict graph should be slightly modified because the DMA transfer and HW setup can be performed concurrently. Table 4 summarizes the scheduling results for the variable bus architectures. When four pipelines are used, we could improve the frame rate performance by 45% for the 32-bit dual-layer architecture compared to the 16-bit single-layer architecture. We achieved the best performance with seven pipelines for the 32-bit single-layer architecture and six pipelines for the 32-bit dual-layer architecture. The 32-bit duallayer architecture with six pipelines has a 118% better performance than the 16-bit single-layer architecture with four pipelines and can process over 30 frames per second. Note that when the number of pipelines increases, more buffers will be required for the boundaries of the pipelines, which will increase the area. As a rule-of-thumb, a 6-pipeline architecture may require 50% more buffers compared to a 4-pipeline architecture. Also, note that if the pipeline cycle is less than the HW module cycles, those modules should be modified to support multi-pipeline processing. 
IV. Conclusions
In this paper, we described a pipelined scheduling of hardware and software modules for platform-based SoC designs. We applied it to the architecture exploration of platforms for a performance analysis. We could achieve a 118% performance improvement in the frame rate by exploring various architectures. The techniques used in this paper can be applied to a decoder, codec, or other multimedia processing applications such as JPEG or H.264 codec. The scheduling results can also be used for firmware coding of embedded processors.
