Multiprocessor SoC systems have led to the increasing use of parallel hardware along with the associated software. These approaches have included coprocessor, homogeneous processor (e.g. SMP) and application specific architectures (i.e. DSP, ASIC). ASIPs have emerged as a viable alternative to conventional processing entities (PEs) due to its configurability and programmability. In this work, we introduce a heterogeneous multi-processor system using ASIPs as processing entities in a pipeline configuration. A streaming application is taken and manually broken into a series of algorithmic stages (each of which make up a stage in a pipeline). We formulate the problem of mapping each algorithmic stage in the system to an ASIP configuration, and propose a heuristic to efficiently search the design space for a pipeline-based multi ASIP system.
Introduction
The miniaturization of transistors in processors has led to a greater increase in chip density and functionality. This results in smaller die size and lower power consumption, making it possible for more portable devices to be developed and manufactured. Chip manufacturers have seamlessly increased the capability and performance of such processor systems by taking advantage of these additional tran-sistors. Conventional approaches include superscalar [26] , SIMD and coprocessor [13] configurations.
Multiprocessor System-On-Chip (MPSoC) continues to gather momentum, primarily driven by the marketing of multi-core chips from Intel, IBM and AMD. The motivation for multi-core processors comes from the increasing difficulty of improving application performance by solely improving clock speed of systems, and the ease of verification. Thus, performance improvement could now be achieved by exploiting the parallelism within the algorithm.
MPSoC can be categorized into two domains; one, homogeneous; and two, heterogeneous multiprocessor system. Homogeneous systems consist of processors which are identical, as used in Symmetric Multiprocessing (SMP) systems. Heterogeneous processor systems utilize various types of processing entities to maximize performance while minimizing area and power consumption. Such systems may consist of a network of DSP, coprocessors and ASIC components fabricated onto the same silicon die. Each different component would be mapped and assigned to specific functions, thus executing multi-threaded applications. Such a system typically exhibits coarse grained parallelism.
In [21, 22] , the authors give examples of such systems which provide a platform where multiple processing entities perform computation on different parts of the system concurrently. These systems can be considered as: one, single-ISA systems; and two, multi-ISA systems. Multi-ISA systems consist of totally different processors which can range from various DSP to CPU implementations [22, 27, 23] . Single-ISA heterogeneous systems [11] allow any application stage to be assigned and mapped to any core in the system with little reconfiguration and modification.
Existing approaches to heterogeneous processor architectures typically map critical regions of software into hardware (i.e. DSP, ASIC etc.). Each hardware component is optimized and suited to its particular mapped region to maximize performance. To increase efficiency and performance of critical systems, Application Specific Instruction-set Processors (ASIPs) [1, 2, 3, 5] have been introduced into such processor architectures. An ASIP's instruction set and its underlying architecture can be configured to a specific application in order to improve efficiency. ASIPs provide a good trade-off between efficiency and flexibility, as the same design can be re-used between different products variants and updated with little additional cost.
Multiprocessors utilizing extensible processors will make a significant contribution to the embedded system domain. In this work, we propose a system configured in a pipelined manner suited for data streaming applications. However, design space exploration of the possible pipeline configuration still remains an art form. While recent research has targeted mapping and selection for heterogeneous processors consisting of custom components and DSPs, we propose a methodology to configure a pipelined system which is properly balanced. Careful trade-off between the number of processors in the pipeline and extensible options of the processors have to be performed to maximize the overall performance. We would like to ensure that the increase in area due to a multiprocessor system is offset by an even higher increase in performance. Our work allows the reuse of existing ASIP design exploration for each individual processor. We explore the design space to approach a near optimal configuration for a heterogeneous processor system, configured in a pipeline approach.
The rest of this paper is organized as follows: Section 2 gives a broad overview of the multiprocessor research thus far and Section 3 specifies the benchmark applications and platform in this work. Section 4 gives a holistic view of the design flow for a heterogeneous implementation ASIP system. The problem is formalize in Section 5 and an efficient heuristic that achieves near-optimal results is presented in Section 6. Section 7 reports the experimental methodology used in this work. In Section 8, we present our experimental results and analyze the performance of the multiprocessor architecture. Finally, in Section 9 the conclusions are summarized.
Related Work
Various heterogeneous multiprocessor systems have been implemented, primarily in the automotive real-time systems [7] and video / image encoding domain. The authors in [27] explored the use of a heterogeneous system in a real-time video and graphics streams management system, while in [32] , the authors applied an adaptive job assignment scheme to perform data partitioning for a multiprocessor implementation of MPEG2 video encoding. A heterogeneous multiprocessor (five cores) for HDTV systems was developed in [10] .
Gopalakrishnan et. al. [15] used heterogeneous systems in a different manner. Their work generalizes the approach started by Baruah [9] which replicates recurring tasks on multiple processing units to ensure a degree of fault tolerance. Maintaining replicas of a task at different processors ensures that single processor failures will be tolerated well.
A multicore system requires various communication schemes to provide the neccesary link between each core in the system. Kim et. al. [19] developed a new CDMA-based on-chip interconnection network using a Star NoC topology. To enable quick design of a multicore processor system and the evaluation of its interconnect system, Wieferink et. al [31] developed a methodology for retargetable MPSoC integration at the system level based on LISA [30] processor models and the SystemC [4] framework.
Single core applications utilize instruction level parallelism, enabled by pipelined processors. Hardware-software implementations can be further enhanced using pipeline schedulings [12] . Extending this scheme, multiprocessors are able to exploit task level parallelism by executing different task on separate cores simultaneously.
Several pipelining methods have been explored. Jeon et. al. [17] partitioned loops into several pipeline stages. The iterative algorithm proposed increased parallelism and reduced the hardware cost of the designed system. Kodaka et. al. [20] combined the both course grain and fine grain parallelism (which includes loop pipelining) using a single OSCAR chip multiprocessor. The work exploits course grain task, loop parallelism and instruction level parallelism using the OS-CAR compiler. The OSCAR chip is comprised of several processorelements (PEs) connected to local memory and shared memory, facilitating data transfer among processors.
A declustering technique for scheduling processes onto a multiprocessor system was proposed in [25] . This technique exposes parallelism instances in an Synchronous Data Flow (SDF) graph in order of importance and attains a cluster granularity that fits the characteristics of the architecture depending on the number of processors intended. However, the work mainly targets shared-memory multiprocessors and do not take into account a pipeline heterogeneous multiprocessor architecture which is used in our work.
Banarjee et. al. [8] incorporated heterogeneous digital signal processors with macro pipelining based scheduling. The technique utilized a signal flow graph (SFG) as a basis for partitioning. The work shows that heterogeneous multi-cores are able to improve the throughput rate several times that of the conventional homogeneous multiprocessor scheduling algorithms.
ASIPs in multiprocessor systems were first explored in [29] . The work proposed a methodology to simultaneously select custom instructions, assign and schedule application tasks on extensible processors. Our work complements this as we explore a higher level of abstraction to select different customized processors which would be suitable in a multiprocessor pipeline architecture.
Givargis et. al. [14] proposed a technique for efficiently exploring the power/performance design space of a parameterized system-onchip (SoC) architecture to find all Pareto-optimal configurations. A directed graph was used to capture the interdependencies and algorithms that search the configuration space, incrementally and prune inferior configurations. In contrast to [14] , we explore the design space of pipelined multiprocessor SoC configurations and the areaperformance trade off behavior.
In [24] , a case study has been performed which evaluates the performance of such pipeline multiprocessor systems against a distributed systems architecture. The work utilized ASIPs and it was shown that selective optimization of the individual cores provide necessary performance improvement to balance the overall latency of the pipeline stages in the system. However, there was no formal approach to explore this design space.
We make the following contributions in this paper: 1. We formulate the problem of mapping processor configurations in the context of heterogeneous multiprocessor pipeline architectures. To the best of our knowledge, our work is the first to address this problem using ASIPs. 2. We propose a heuristic to rapidly produce a near optimal configuration for given benchmark applications which are partitioned into stages. We conduct a design exploration of pipeline multiprocessor designs for MP3 and JPEG encoders. 3. We show the proposed techniques by enhancing a commercial design flow (using Tensilica's Xtensa LX platform) and applied them to real embedded streaming applications (e.g. JPEG encoder, MP3 encoder). Our work is based on optimizing sequential applications, which have the following characteristics: 1) Each streaming application contains a kernel which is partitioned to several pipeline stages. This kernel is run multiple times (in JPEG for example, this will be run every frame). Minor loops which have such characteristics would be considered as atomic and would not be further partitioned.
Background
2) The application exhibits a dataflow software architecture. Input data is sequentially processed deterministically and output as results in the same order and manner.
An application which has the above characteristic can be partitioned to represent different stages in a pipeline flow. The partitioned application is derived from a standard sequential program written in C. Figure 1 shows possible designs that can be implemented as a pipeline multiprocessor implementation. These designs have been manually created by the designer as examples for the design exploration. Each design has a set of cores, where each pipeline stage is executed by at least one processor. A particular stage takes inputs from the previous stage which is connected via FIFO queues. Thus, these connected systems allow each core to run independently of the other, provided that each stage has the necessary input to begin data processing.
Each multiprocessor configuration from Figure 1 has a large design space. Each processor in the pipeline system can be configured and mapped to special purpose hardware. A particular configuration which is generated for a design has to be optimal (or near optimal). This near optimal configuration can be achieved by changing hardware parameters of each processor to achieved the required performance at the lowest possible cost. Finally, the system is optimized, such that the area increase incurred by pipelining the systems is more than offset by the increase in performance.
Benchmark Applications
Readily partitioned benchmark programs are not freely available to the research community. We created our own set of benchmark applications based on single processor benchmarks. Two freeware compression algorithms, MP3 and JPEG encoding algorithms, were chosen and ported to the Tensilica Xtensa LX [5] platform architecture.
The data flow graphs are obtained from these benchmark applications by analyzing the data stream throughout the benchmark applications. These benchmarks are partitioned manually into various pipeline / data flow stages, adhering to the respective standards (JPEG & MP3). The partitions are then mapped to stages in a pipeline system (refer to Figure 1) . These stages are then written as standalone programs in Xtensa LX processors.
We created four multiprocessor configurations for the JPEG encoder and two configurations for the MP3 encoder. Figure 1 shows the set of designs of the two benchmark applications. The figure also shows the connectivity of each stage in the pipeline; each arrow denotes a FIFO connection to the next stage in the pipeline. Due to space restriction, we do not show the mapping of task to the various stages in the pipeline implementation of Figure 1. 
System Architecture
The Xtensa LX [5] is part of the Tensilica line of cores which is configurable, extensible and supported by automatic hardware and software generation tools. The core is synthesizable and allows designers to configure each implementation to match the target application requirements. It supports extended instructions include fusion instructions [28] , SIMD/vector instructions and FLIX [6] The key feature which is used in this work is the queue interface (introduced in Xtensa LX -refer to Figure 2 ). This feature support external communications at a much wider bandwidth than existing interconnects. Queue interfaces (TIE instructions) are used to pop an entry from an input queue for incoming data or push data to an outgoing queue. The Xtensa Toolset automatically generates the logic to stall the processor when it reads an empty input queue or writes to a full output queue.
We have profiled the single processor implementations of the MP3 and JPEG encoders on several architectural configurations and have Table 1 ). The table also shows the extra configurable options that are implemented in the LX1 core and not the LX2, due to the higher computation demand of the MP3 benchmark application. The system is currently simulated in a cycle accurate Xtensa Modeling Protocol (XTMP) environment. The overall core size for each core include the instruction and data cache area and extended instruction size. This is explained in Section 8.
The System
Original Program
Data Flow Graph Partition program
Simulate configurations The design flow for obtaining the best configuration for a given application is summarized in Figure 3 . The input to the system consists of: an application written in C/C++, a library of pre-configured processors and a set of cache and XPRES configurations (with the respective area utilization information). The process starts with a program which is then compiled and profiled to detect the hotspots in the algorithm (Figure 3-a) . The designer then derives a data flow graph which describes the flow of data through the program. This information can be used to manually (or automatially) partition (Figure 3-b) the program into multiple modules (Figure 3-c) .
The designer may produce a set of possible architectures as shown in Figure 1 . Each design may have different number of pipeline stages and different parallel pipeline flows. Each design would consist of individual standalone programs; each capable of running independently on a microprocessor core. A heuristic (refer to Section 6) is used to rapidly explore the design space to find the best architectural configuration (Figure 3-d) .
The algorithm would produce a configuration which is near optimal. The algorithm is run on all possible designs until the most optimum configuration is found. This heuristic would be used on all possible architectures from the designer (Figure 3-e) . The design flow would eventually produce a set of partitioned programs, including their core configurations (i.e. cache and XPRES configurations).
Problem Definition
The selection and mapping of different regions of software to multitudes of hardware configuration have been widely explored. Previous approaches to hardware-software codesign using ASIPs in a multiprocessor configuration utilize various mappings to NP-hard problems, notably the 0-1 knapsack problem and its derivatives.
Assumptions
We make the following assumptions:
• Each design terminates with only one output processor • Each stage in the pipeline refers to a physical processor with the assigned task of the stage • Runtime is calculated assuming the processors are not stalled due to an empty input queue (POP stall) or a full output queue (PUSH stall). This assumption is valid since the overall computation time for a pipeline will be dominated by the stage with the longest execution time (critical stage in the pipeline). Thus, for the purpose of estimating the pipeline design performance, these stalls can be ignored. (within 2% accuracy in our experiments -Refer to Figure 5) • Whenever a stage has parallel pipelines, we assume that all parallel parts are identical in terms of the number of pipeline stages and processors, such as in stages 2 & 3 of Figure 1(d) . This assumption allows us to simplify the design process and as extension to this work, we can have asymmetrical pipeline stages which is beyond the scope of this paper.
Formulation
A pipeline program can be represented as a Data Flow Graph (DFG) where vertices represent tasks and edges represent FIFO connections between processors. The designer begins with a list of N possible multiprocessor designs. These different designs would already be partitioned prior to this stage either manually or automatically.
A program can be represented as a process graph, G. The partitioned processes which represent the stages of the pipeline are given in set
where J is the maximum number of nodes in the design. Each vj represents a processor which is mapped to partitioned code segments (refer to Section 3). FIFO connections for the design are represented by
The execution time and area cost of the various process stages have been profiled and can be obtained via the functions R k () and C k ().
The CFG set contains the processor configurations on which the different program stages will be running. The different processor configurations refer to the various combination of cache configurations and the enabling of extended instructions generated via the Tensilica XPRES Tool [5] . We define the functions below:
where K = |CFG| is the total number of configurations available. A set of configuration defines the configuration for each core, vj. The runtime, R of each set of configurations is obtained by using an instruction set simulation (refer to Section 7). The total cost of the design would be sum of the cost of each individual processor implementation in the pipeline design.
where k is the corresponding configuration number of vj, and J is the total number of processors in the design. Finally, the cost function which needs to be minimized is defined as
where R and C are all integer values. Given the definitions above, we now try to solve the Θ minimization problem using the algorithm in Figure 4 .
'C j is a set of area cost of K implementations of pipeline j
Find all possible design configuration combinations:
For each design k * ∈ K * ' Find total area cost Area Cost = P J (c j ) where c j = C k j (v j ) and k j ∈K * ' Find runtime for this particular implementation Runtime = simulation output with configuration k * Θ = Runtime × Area Cost End for Configuration k * with smallest Θ would be selected 
Heuristic
The problem formulated above (as an algorithm) is exhaustive, and will not converge to a solution quickly, due to the large design space. The permutations for different implementations would result in an exponential complexity of order, O(n p ), where n is the maximum number of possible processor configurations, and p the number of processors in the multiprocessor configuration. Table 2 shows the overwhelming computation time for such an exhaustive search. To more effectively explore the design space, we developed a heuristic to closely match the optimal configuration given by the algorithm in Section 5 without simulating all possible configurations.
The different implementations of the pipeline stages would result in different execution times within the pipeline. As faster stages will stall for slower ones, we assume that on average, each pipeline stage latency would be equivalent to the latency of the critical stage. We now redefine the runtime, R of a configuration to include runtimes of each individual core. The runtime of a particular configuration can be defined as
where I is the number of iterations and R init ,R process and R f inal are mapping functions to the initialization, core iteration and finalization execution times respectively. In pipeline systems, increasing workload in the pipeline would bring the system closer to the theoretical performance improvement [16] . Similarly, as the number of iterations, I increases to a significantly large number, the sum of the latencies of the pipeline could be ignored. The equation above can then be simplified to This permits us to calculate the execution time of the pipeline system by using only runtimes of the initial processor, the critical stage processor and the final processor. Figure 5 shows the distribution of errors to the runtime information when the above equation is used. These errors were produced using full simulation and calculated times of the five and nine processor JPEG systems. It is seen from Figure 5 that the estimated runtime errors are within 2.5% of the actual value.
Based on Equation 9, we develop the heuristic shown in Figure 6 . The heuristic relies on the fact that our simulation period is of the order O(n × p), compared to the complexity of O(n p ) in Section 5, where n and p are the maximum number of possible processor configurations and pipeline stages respectively.
Get minimum core iteration runtime of each processor:
Find critical node:. ' Critical node is the processor with the worst minimum core iteration runtime Critical node is v crit where ∀k ∈ K, R process k
Start with critical node:
For each configuration k in set K:
Evaluate all other nodes:
For every other node, v j ∈ V :
Filter all configurations k where R process k
If this is the first node, calculate Cost= R init
Configuration k with smallest cost would be selected Next node Output: The set of k's obtained for each processor 
Experimental methodology
We used Tensilica's Xtensa RA2006.4 Toolset for the Xtensa LX family of processors. The toolset provides a set of compilation tools to compile C/C++ code for the architecture described in Table 1 . The Tensilica Instruction Set Simulator (ISS) and Xtensa Modelling Protocl (XTMP) environment were used to run the multi-core systems. For each system, multiple Xtensa cores were instantiated and XTMP was used to connect the cores together, including memory models and peripherals. The ISS directly models the Xtensa pipeline and operates as a system-simulation component using the XTMP environment. With XTMP, different multiprocessors configuration could be set up and simulated rapidly.
The simulator allows for communication between the cores and peripherals using a cycle-accurate, split-transaction simulation model without using a clock. The ISS was used to generate profiling data for all cores in the system, which were then analyzed using Tensilica's gprof profiler. The profiles can include the cycles for all functions executed by the cores. The ISS can also print a summary of the total cycle count and global stalls of each core.
Each individual core is connected via the queue interface provided by the Xtensa LX core using the XTMP environment. Queue models have been created and used in the XTMP environment as libraries. In our work, we simulate all queues with a very large amount of queue buffers, so that no PUSH stalls will occur. If a fixed queue size is required, our architecture design can always be easily mapped to a Kahn Process Network [18] , and using the available tools for KPNs, the optimal queue size can be calculated.
We created our benchmark programs by identifying the various stages of the MP3 and JPEG encoders and mapping them to individual processors. We partition and allocate these stages base on the open standards of the respective encoders. We created four multiprocessor configurations for the JPEG encoder and two configurations for the MP3 encoder. An XTMP simulation program, specially customized to generate profiling, runtimes of each stage in the pipeline and other relevant benchmark information is created for each of these multiprocessor systems.
The toolset also includes the XPRES (Xtensa PRocessor Extension Synthesis) compiler which creates tailored processor descriptions for the Xtensa processors from native C/C++ code. We are able to reuse the existing ASIP design flow to create custom RTLs for each core in the system. Using the designer-defined input of C programs to be analyzed, XPRES extends the base processor with new instructions, operations and register files using TIE extensions. It does so by automatically generating a new TIE file which can be included when recompiling the source code. Half the set of the configuration options defined in Equations 4, 5 and 6 are XPRES enabled. However, it should be noted that the XPRES tool configuration was not run in the MP3 design space due to the long simulation time for MP3 encoding. Nevertheless, our heuristic still provides a configuration close to the optimum configuration in the MP3 design space.
Area cost include the base processor, instruction & data caches and the TIE instructions. A raw image of size 227 by 149 pixels is used as raw input stream to the JPEG encoder systems, whereas a 6 second PCM encoded music clip is used as the input stream to the MP3 systems. Figure 7 shows the design space exploration for both the JPEG and MP3 multiprocessor systems. Figure 7(a) shows the design space of the JPEG algorithm implementation. The subfigures on the left show the runtime performance of the benchmarks vs area. In all four figures, the group of data points on the left corner of each graph corresponds to the single processor implementation. The square markers on the graphs are the points obtained via our heuristic algorithm. Table 3 shows the runtimes and cost functions obtained via the heuristics which we develop in Section 6. The first column shows the number of processors in the design. The second column denotes the number of pipeline stages in the system. If there are more processors then pipeline stages, this denotes that a parallel pipeline stage exists. MP denotes multi-pipeline while SP denotes single pipeline. Columns four and seven show the runtime and cost obtained via our The deviation of the heuristics obtained is shown as a percentage of the best possible values obtained from an exhaustive search. Do note that our algorithm emphasizes on reducing the cost function rather than maximizing performance of the application. Nevertheless, our heuristic still produces runtime values close enough to the best possible runtime.
Results & Analysis
From our heuristic, we are able to determine a near optimal configuration. These are the design with five processors (four stages) for JPEG and the design with four processors (three stages) for MP3. The design for the systems are shown in Figures 1(b) and 1(f) respectively.
For JPEG encoding, we obtained around 5.47% of the optimum value, while in MP3 encoding we manage to get close to within 5.74%. The heuristic analysis provided us with the configuration which is close to optimal for the particular MP3 and JPEG encoders.
In Figures 7(a) and 7(c), we show that our heuristic provides us with points close to optimum runtime while minimizing the cost function. With the five processor (four stages) parallel pipeline JPEG implementation, we are able to obtain at least 4.11X speedup over the single processor implementation and 3.36X speedup, with the four processor (three stages) parallel pipeline MP3 encoder.
Conclusion
In conclusion, we have formalized the problem of mapping processor configurations in the context of ASIP multiprocessor system which are implemented in a pipeline manner. We have also presented a heuristic to obtain a near optimal configuration (smallest cost) given a partitioned benchmark program. This is complemented with a full methodology that uses this heuristic to rapidly explore the architectures provided from the designer and thus explore and select the architecture which provides the best performance per area. This framework utilizes Tensilica's Xtensa LX [5] configurable cores which provide the queue interface that is used to connect each processor in the system in a pipelined configuration. We have explored the design space of such an architecture by using the existing ASIP design flow to rapidly select the best cache configurations and extended instruction to provide the neccesary speedup while minimizing area; thus providing a good performance to area ratio.
