Abstract-This paper introduces an application mapping methodology and case study for multi-processor on-chip architectures. Starting from the description of an application in standard sequential code (e.g. in C), first the application is profiled, parallelized when possible, then its components are moved to hardware implementation when necessary to satisfy performance and power constraints. The key contribution of this work is a methodology for high-level hardware/software partitioning that allows the designer to use the same code for both hardware and software models for simulation, providing nevertheless preliminary estimations for timing and power consumption. The methodology has been applied to the coexploration of an industrial case study: an MPEG4 VGA realtime encoder.
I. INTRODUCTION
Technological advances have made multiprocessor implementations of embedded systems a viable alternative to traditional single-processor and pure-hardware designs. Such multiprocessor designs offer high levels of performance and flexibility and at the same time promise low-cost and power-efficient implementations. Nowadays, one of the most promising approaches to design of such systems is a so-called Multiprocessor System-on-a-Chip (MPSoC) paradigm. A typical MPSoC system consists of a number of processing elements (PEs), which can be programmable processors or fixed applicationspecific coprocessors, and storage elements (SEs) connected to PEs via an on-chip communication architecture. As a result, MPSoC architectures represent heterogeneous systems that offer flexible parallel processing resources for implementation of bandwidth-demanding multimedia applications.
However, MPSoC platforms introduce several design challenges associated with their parallel and heterogeneous architecture. Platform-based design [1] faces the problem of the mapping of an application defining a configurable microarchitecture platform, onto which an application can be mapped through a well defined parallel programming model, specified via an application-program interface (API) platform. The mapping of an application to a microarchitecture platform starts from a complex system specification and goes through the vast design space exploration. In this context, the reuse of large base of existing software to perform the exploration of different possible implementations constitutes an important concern. The existing software commonly written in C/C++ language with a single-processor architecture in mind cannot Pierre Paulin, Essaid Bensoudane STMicroelectronics Ottawa, ON, Canada {pierre.paulin,essaid.bensoudane}@st.com be directly reused in a multiprocessor environment, especially in that consisting of a heterogeneous mix of different software and hardware components. The existing software needs to be adapted to the parallel capabilities of the architecture. Furthermore, to enable fast and flexible exploration of the possible application-to-architecture mappings, it is necessary to automate the hardware-software partitioning of the application. Therefore, there is a need for a disciplined approach based on a unified parallel modelling paradigm that would enable a smooth translation of existing sequentially-coded software algorithms into their parallel models suitable for the design space exploration of MPSoC platforms.
In this work, we show how it is possible to map an application to the MultiFlex [2] platform, exploiting the features offered by the combined use of the DSOC and the SMP programming model to provide a novel way of performing initial partitioning and exploration. The use of the DSOC programming model together with transaction-level modeling (TLM) [3] infrastructure allows easy moving of a component from hardware to software and vice-versa.
The rest of this paper is structured as follows: Section II describes current approaches to the parallel mapping problem; Section III outlines the proposed mapping flow; Section IV describes an industrial case study to which the proposed methodology was applied; and finally, Section V draws some concluding remarks.
II. RELATED WORK
The architectural changes introduced in the emerging MPSoCs have a direct consequence in how software engineers program. This fact has already been acknowledged by several researchers, who have proposed preliminary solutions. Most of them agree on the importance of new high-level programmer views of SoC; [4] makes the analogy between a programmers' view to a heterogeneous multi-processor SoC and an instruction set architecture to a single processing element.
A number of programming models focused on multiprocessor SoCs have been presented, such as the MESCAL approach [1] , which has served as base for further different programming models. Nevertheless, most of them are application or domain specific, as the one proposed in [5] , which only addresses communication modeling.
A more general approach composed of two SoC parallel programming models has been introduced in [6] . The Distributed System Object Component (DSOC) model and the Symmetric Multi-Processing (SMP) model are inspired by leading-edge approaches for large system development, but adapted and constrained for the SoC domain.
Various actor-oriented frameworks are proposed to capture arbitrary Models of Computation (MoC) for the purpose of system level modeling and tool supported paths to exploration, implementation and/or verification [7] , [8] . The modeling strategy presented in this paper can be implemented on top of any of these MoC generic frameworks. We selected SystemC mainly because of the broad user acceptance and commercial tool support.
Complementary to our top-down refinement flow, the Component Based Design paradigm [9] advocates the bottomup platform composition from a parameterizable IP library, containing off-the-shelf processing elements, communication fabrics and hardware dependent software layers. This approach is clearly advantageous for the rapid exploration and implementation of the general purpose portion of the application [10] , whereas our approach is focused on application specific architectures executing the data-processing part.
The highest possible abstraction level for design space exploration and application mapping is static performance analysis [1 1], [12] . Other approaches are closer related to simulation frameworks for top-down exploration and refinement like ARTEMIS [13] , MESH [14] , StepNP [6] .
The ARTEMIS project [13] is focused on an automated refinement of coarse-grain Kahn Process Network algorithm models to fine-grain architecture models for the purpose of design space exploration and synthesis. The Modeling Environment for Software and Hardware (MESH) [14] project is concerned with modeling of heterogeneous MP-SoC platforms above the cycle-level accuracy. Here, schedulers are considered as the central modeling element to capture the dynamic and data-dependent nature of MP-SoC platform mapping.
Some recent works present simulation frameworks for mapping applications based on SystemC [15] , and mapping and scheduling of applications on parallel architectures [16] , [17] .
The SystemC based StepNP [6] simulation framework, on which this work is based, also advocates the joint consideration of communication architecture and application specific processing elements. StepNP is based on the MultiFlex approach [2] , that provides the two programming models that are used throughout this work to map application onto the StepNP platform: DSOC and SMP.
III. PROPOSED MAPPING FLOW
The proposed mapping flow allows the designer to coexplore the application design space including both the architecture model and the application source code, as shown in Figure 1 . This flow is targeted to highly parallel applications, like, for instance, multimedia ones. This flow starts from an executable specification of an application, in a language . Static Profiling: the source code of the application is profiled natively on a workstation and the total computation effort needed by the application is estimated, given performance constraints. It is worth noting that this provides the lower bound to the number of processors, in case of a fully software solution. . Parallelism Extraction: since the target architecture is an MPSoC, the intrinsic parallelism must be extracted from the application to exploit the available system resources. The general problem of automatic parallelization of the code is out of the scope of this paper. We consider code that has been already parallelized for the MultiFlex programming models. This step identifies data dependencies at very coarse-grain, e.g. image macroblocks for an MPEG4 encoding algorithm. . Validation: any modification to the reference source code is validated against the reference data, to avoid losing the original program behavior due to some neglected dependencies. . Architecture Modeling: static profiling provides the lower bound in terms of computational resources to run the application and to meet the constraints, hence we need to model the architecture components considering these basic requirements. As an example, the static profiling states the minimum number of processors for a fully software implementation. . Co-simulation: the architecture model and the software are co-simulated, extracting both performance and power measures.
. Analysis: simulation results are collected and analyzed to modify both the application (e.g. to increase the parallelism) and the architecture model, by modifying the initial hardware and software partitioning, for example by moving some parts of the application in hardware or by varying some platform architectural parameters. The mapping continues to cycle through these steps until the constraints for the application are fully met. These steps are detailed in the following.
A. Static Profiling
Statically profiling an application consists of determining its computational requirements. The idea is to define a lower bound to the resources needed by the application. In this way, it is possible to configure the micro-architecture platform so that it satisfies the lower bound. Static profiling can be done natively on any machine that supports a profiling tool-set, like gprof [18] or iprof. iprof identifies the number of instructions needed to execute the native code, while gprof provides analysis on the call graph. Merging the two outputs allows the designer to identify the computational needs of the application and how those need are distributed in the code. As an example, an MPEG4 encoder developed at STMicroelectronics (described in detail in Section IV), when profiled on a x86 architecture, shows a requirement of 4.081 GIPS (Giga Instructions Per Second). It is worth noting that this requirement is strictly valid only on the same Instruction Set Architecture (ISA) on which the application was profiled. More complex (or simpler) ISAs may run the code using less (or more) instructions.
Considering a micro-architecture platform where the target processor is an ARM9 core running at 200MHz, the bound for a fully software implementation is 21 ARM CPUs. In fact, ARM9 has at most a CPI (cycles-per-instruction) equal to 1:
with c and f the number of processors and their frequency respectively. Even supposing that the micro-architecture platform has enough computational power, a sequential application cannot run on all the processing elements and therefore the designer needs to modify it to exploit all possible inherent parallelism.
B. Parallelization Parallelization requires the identification of sections of code working on independent sections of the application's input data. This is not trivial even for applications that exhibit an "embarrassing" level of parallelism like multimedia audio and video encoding. The kind of parallelization needed for distributing tasks to the processing elements of an MPSoC is coarse-to medium-grained. Although there is a large amount of research devoted to the automatic parallelization of code [19] , [20] , coarse-grain parallelizing compilers are still not widespread in the industry. It is more common to directly apply programming models to sequential code, leaving the designer to identify data-dependencies in the code. As an example, a routing application like an IPv4 forwarder, can be parallelized in such a way that every packet is manipulated by a separate thread. The use of the MultiFlex SMP approach reduces the effort [2] because it provides a wellestablished parallel API, that is entirely similar to multithreaded programming when using a traditional UNIX-like operating system, in particular POSIX threading. In addition, the Concurrency Engine (a hardware SMP accelerator that is part of the MultiFlex approach) provides load balancing functionalities, leaving the designer to the only effort of determining the data dependencies of the target application. The proposed flow involves the application of the MultiFlex SMP programming model, identifying data dependencies manually. As an example, considering STM's MPEG4 encoder, independent data blocks in the motion estimation algorithm are represented by a "knight's move" on the chessboard formed by dividing a frame to be encoded in macroblocks (squares of 2x2 blocks of 8x8 pixels), as typical in JPEG and MPEG compression [21] . The parallelized application starts a thread for each independent macroblock to be encoded, as shown in Figure 2 .
C. Validation
After some parts of the application have been parallelized, the resulting code has to be verified again in order to prove that the functional requirements of the application still hold. Applying regression tests to simulated code can be excessively time consuming and may seriously slow down the design process. To overcome such problem, we exploit the similarities of the MultiFlex programming model to Pthreads. It is conceivable to compile natively the same code that would run on the microarchitecture platform, given that the API is the same. Using Pthreads it is possible to explore the parallelization of the application natively, obtaining the exact same behavior that would take place on the simulated platform. The only attention to be given is to use only those Pthread's concurrency control structures that have been implemented in the Concurrency Engine (such as semaphores, monitors, conditions). Validating the parallel design implies enforcing proper synchronization through the use of those structures avoiding any sort of deadlock or race condition. The time saving introduced by this solution is conspicuous. As an example, a simulation run over 30 frames of the MPEG4 encoder takes about Is natively, while it requires minutes (or even hours, depending on the accuracy level) when it is executed, on the same workstation, with an instruction-set simulator (ISS).
D. Simulation
Once the first run of mapping has been applied, the architecture model and the software can be co-simulated. Cosimulation is performed using a transaction-level simulator (in our case, StepNP [6] ), first using timed functional models. The first run of simulation provides data concerning the actual performance of the system, giving bounds to delay and energy consumption. The results of the simulation provide data to proceed in the exploration of the platform configuration. Examples of this results might be low processor utilization, which mean that channel latency is excessive and has to be accounted for, as an example using hardware multithreading [22] . Whenever the application meets the required constraints, simulation can be performed at a lower abstraction level, going into progressive refinements that lead to the actual implementation of the system.
It is unlikely for a complex, high speed, application, that a full software solution gives acceptable performance. In this case, some parts of the system may be implemented in hardware, raising the performance but, at the same time, raising the platform's design cost and lowering its flexibility.
E. Hardware-software partitioning
One of the main advantages of the proposed flow is the ease of HW/SW partitioning through the use of DSOC. In fact DSOC does not make a distinction, from the point of view of the user, between hardware and software components. The DSOC ORB routes requests and the MP engine takes care of the marshaling and unmarshaling activities. In fact, it is possible to create transaction-level models of the application functions using the exact same code that is used for software models, as shown in Figure 3 .
During the static profiling phase, the critical kernels of the applications have been identified. These are implemented as DSOC objects, defining their function signatures as appropriate interfaces using SIDL. The SIDL compiler generates skeletons and stubs needed for communication between DSOC clients and servers. Using a transaction-level DSOC object adapter, it is possible to connect any DSOC object to a communication channel. Since the code used by DSOC objects can be either compiled for the simulated processors or natively, it is kept into a separate library compiled in both forms. Being StepNP (and MP4Free) based on SystemC, which produces native executable simulators, the natively-compiled library can be linked directly to the simulator, and accessed through the stubs. This allows the creation of untimed functional models of each component modeled as DSOC. Models can be turned into timed functional ones adding appropriate wait () statements in the SystemC wrapper. It is worth noting how this approach is not specific to a type of function or hardware component, but it is completely general: it is sufficient to define a SIDL interface to connect a new object.
The proposed methodology, however, suffers from a minor limitation. Since the simulated application and SystemC have different address spaces, it is not possible to use pointers when passing parameters to hardware components. This means that Fig. 3 . Exploiting DSOC to re-use code for both hardware and software object models is is not possible for a hardware component to access memory via DMA, unless the SystemC wrapper is sophisticated enough to support address space conversion. Concerning power consumption, the timed functional models associated with each DSOC object have no power model. To overcome this limitation, we roughly estimated the energy consumption per access to the components using SPARK [23] . SPARK is a behavioral C synthesizer, that produces RT VHDL code. The VHDL code is synthesized as well in Synopsys Design Analyzer and estimated with Power Compiler. The result gives a first estimation of the energy cost per access to the component, as shown in Figure 3 .
Exploiting DSOC, SystemC and TLM, the designer can perform the HW/SW partitioning of the system as a matter of turning some switches and running simulations, greatly simplifying high-level design space exploration. The methodology has been applied to a set of critical kernels, as outlined in the following.
IV. CASE STUDY: AN MPEG4 ENCODER
This section describes how the methodology was applied to map an MPEG4 onto the MultiFlex platform. This platform is constituted of a variable number of ARM processors with a variable number of hardware threads, a two-level cache structure and the STBus interconnection network [16] .
The application, specified in C, has been initially profiled statically, using gprof and iprof (GNU open source tools) on a Linux machine. Profiling defined a lower bound on the number of processors needed for execution: the computing power needed by the applications amounts to 4.08 GIPS. A full software solution would require a minimum of 21 ARM CPUs running at 200MHz (each one providing at most 200MIPS). Table 2 shows the results for the 9 functions that take most of the execution time during the encoding of a frame. These functions represent only a very small portion of the application code (approximately 6% of 8086 lines of code) but they cover 82.81% of all computational resources needed for execution. As the first design choice, these blocks where selected for a has the effect of raising the performance. This means that the DCT is one of the major bottlenecks of the system, and therefore justifying the hardware design choice. The frame rate increases as more components are turned to hardware, but this is not sufficient to reach the 30 frames per second constraint, and more optimizations have to be done in terms of parallelism exploitation.
To exploit the MPSoC architecture, the MPEG4 application has then been split into parallel sections working on independent data. This phase has been optimized manually for the target MPEG4 application. The inner loops were parallelized using fork-join constructs. Data dependencies were carefully analyzed and verified after parallelization.
The architecture has been explored with the proposed flow, and the graph of Figure 5 summarizes the overall performance results, expressed in frames per second (fps) achieved, for a range of architecture parameters. These include the number of processors (2 to 5) and the number of threads per processor (2 to 8). The upper curve represents the theoretical upper bound for a perfect parallelization (i.e. results for a single processor accessing local memory, and then simply multiplied by the number of processors). This theoretical result does not include any inter-processor communication code and assumes zero bus latency. The best result makes use of 5 processors and 8 hardware threads per processor. In this case, 28.5 fps is achieved, or 86% of the theoretical best result of 33 fps. processors, after parallelization, showing a result close to the theoretical upper bound when using a no-wait-state channel and a result very close to the latter when using STBus.
V. CONCLUDING REMARKS
This paper presented a mapping methodology for applications on the MultiFlex platform. In addition, this work presents a fast partitioning exploration scheme that takes advantage of the DSOC programming model. These methodologies have been applied to a multimedia case study: an industrial MPEG4 encoder, showing the validity of the approach. Future works include the integration of the methodology with automatic exploration algorithms and automatic parallelization of the application code.
