Abstract-There is a clear trend of future embedded sys-heavily affects the dynamic energy cost and the execution time.
I. INTRODUCTION situation, task scheduling can be used in order to expand data
Nowadays the semiconductor industry is facing several assignment freedom.
technological challenges to build media-rich mobile wireless In this paper, the goal is to provide a complete framework terminals in a profitable way. The applications realized by for mapping dynamic multimedia applications onto embedded these devices require an enormous computational performance multi-processor platforms. We present the first attempt to at a sufficiently low-power consumption. Industry strongly unify all the design stages, creating a design flow handling believes that multiprocessor platforms (heterogeneous or ho-the dynamic data of multimedia applications. Initially, the mogeneous) are a promising way to meet the aforementioned methodologies were developed separately from each other [3]-challenges [1]. To limit power consumption, platforms for mo- [5] . Now, that they have reached in a quite mature level, is bile systems usually consist of multiple processors. Employing needed to make them work in a unified design flow, assessing multiprocessor systems make easier for the embedded devices how they cooperate with each other and eliminate possible to support more services and additionally to cope with the conflicts among them. With the development of the adequate advanced complexity and dynamism of modern applications. automation tools, the proposed design flow will demand minHowever, there is the need to exploit the dynamic and imum intervention by the designer, leading to an increase of multi-threaded character of the targeted application domain. design productivity. One of the most critical bottlenecks is the very dynamic and The remainder of the paper is organized as follows. In concurrent behavior of many current multimedia applications. Section II, we provide an overview of the related work. In In order to deal with these new dynamic applications where Section III, we present the set of methodologies consisting tasks and complex data types are created and deleted at run-our design flow. In Section IV, are described the targeted aptime based on non-deterministic events a new design flow is plication, the execution platform and the experimental results.
needed.
Finally, in Section V we draw our conclusions.
Additionally, among the different multiprocessor platforms II. RELATED WORK available, the designer can find different memory hierarchy organizations. These hierarchies consist of combinations of Traditionally, embedded systems designers have been very caches, scratchpad memories and SDRAMs. Thus, the de-reluctant to the use of dynamic memory allocation in their signer not only has to map efficiently the application on applications. The reason is that the management of the data the platform, but to perform an efficient exploitation of the structures needed to keep track of dynamic memory was too underlying memory hierarchy. The efficient use of memory can expensive. Relying on the dynamic memory subsystem can avoid conflicts and page misses. The number of page misses ease the design of the rest of the system, but can also in- The design flow (depicted in Fig. 1 ) consists of three stages, block sizes (i.e. 1024, 2048, 512, 256 and 128 Bytes in order). with each stage trying to effectively handle the dynamic data Then, exact fit was chosen for the fit algorithm due to of applications from its own perspective. In order to maximize the fact that the block-sizes were very different [3] . As for the results of these stages, they should be efficiently combined the coalescing and splitting only the memory pool in abstract in a unified flow. DCacheh that may be executed concurrently. The goal is to study the _ relationships between tasks and the way the data are accessed, so to make an optimal scheduling of their memory accesses X for the underlying memory hierarchy of the targeted platform.
As first step, the designer specify an embedded application at O Chip a Multi-Task Graph (MTG) model combined with high-level SDRAM features of a Control-Data Flow Graph (CDFG) model [16] . That allows the designer to identify the memory conflicts that Fig. 2 . The targeted multiprocessor platform. occur very often between the concurrent tasks.
Having the tasks being formulated and in order to eliminate zero-tree coding and arithmetic coding. Its software realization those conflicts, T-DTSE methodology offers the means for requires around 5K lines of C++ code.
dynamic memory pool assignment. The methodology provides The embedded platform (depicted in Fig. 2 ) that was used a framework that making feasible the optimization of the clus-in our experiments consisted of two Texas Instruments (TI) tering of blocks of dynamic data in order to be placed later in C6202 [17] running at 250 MHz each and a two-bank SDRAM physical memories. In order to achieve that, the designer uses memory of 32 MB in total. The platform was running the the DM allocators built in the previous stage with the addition VIRTUOSO real-time operating system. Virtuoso (now known of a few extra features (e.g. bank awareness, scheduling of as VSPWorks from Windriver) is a commercial RTOS, which data). Thus, the DM allocators take into account the conflicts features a high-performance kernel design with small memory between the different dynamic data (i.e. coherency of data footprint, and an advanced virtual single-processor (VSP) accessed at the same time). Using the new DM allocators architecture for the development of embedded multiprocessor and by using different data assignment in the memories the and distributed applications. application is executed. Next using the profiling information
In total, 40 configurations of different DM managers needed extracted by the execution, the optimal data assignment is to be evaluated to achieve the Pareto curve of Figure 3 . defined.
The creation, implementation and evaluation of all the DM allocators took 13.3 hours in total. This is a memory footprint -memory accesses Pareto-optimal curve, which shows an TCM is the methodology to find the energy-efficient available reduction up to 4.88% for memory footprint and way to map dynamic real-time applications with concurrent up to 4% for memory accesses, within the available Pareto tasks/subtasks onto multiprocessor platforms.
configurations. Furthermore, the general purpose DM allocator The methodology takes as input a high-level description for Windows-XP based systems [18] was tested to compare of the application. That application has already implemented it with our custom DM allocators, showing that our Pareto inside it the optimized memory allocators originated by the solutions reduce the energy consumption up to 82.9% and the two previous steps. The purpose of the methodology is to execution time up to 3.8%. determine a cost-optimal (e.g. energy consumption, deadInitially a model of the application is built featuring the line miss rate) constraint-driven (e.g. throughput or latency) different tasks and subtasks. From the model and application scheduling of the various tasks and subtasks on a set of profiling we observe that the decoding of the image blocks homogeneous or heterogeneous processors. The output of the takes up to 65% of the execution time. So, this is the function methodology is Pareto curves with each Pareto point indicating we focused on. We considered as a task the decoding of a a schedule. Then, the designer must choose that schedule that block and as subtasks the decoding of each color coefficient meets, in the best available way, the design constraints.
(Y, U and V).
The next step was to convert manually from C++ to C (due to compiler limitations) the DM manager that derived from the The application that was used in order to apply the design DMMR stage. Furthermore, the functionality of the manager flow was the Visual Texture Coding (VTC), which is used was enhanced, making it capable of allocating dynamic data in in MPEG-4 standard [15] in order to compress the texture different banks of the SDRAM memory (bank-aware). In each information in photo-realistic 3D models. AS the texture in function of the allocator (e.g. malloc/free /calloc), an a 3D model is similar to a still picture, the application can ID denotes in which bank of the memory the function should also be used for compression of still images. It is based on act upon. With the use of the bank-aware memory allocator the discrete wavelet transform (DWT), scalar quantization, we manage to achieve additional gains of 6.3% in execution is an additional tuning of T-DTSE and TCM methodologies 7,15x106 7,20x106 7,25x106 7,30x106 7,35x106 7,40x106 7,45x106 7,50x106 towards dynamic data.
Memory Footprint (in Bytes) VI. ACKNOWLEDGEMENTS To cope with these complex tasks embedded systems employ dngenlib/html/heap3.asp
