Due to the limited amounts of on-chip memory, large volumes of data, and performance and power consumption overhead associated with interprocessor communication, efficient management of buffer memory is critical to multi-core image processing. To address this problem, this paper develops new modeling and analysis techniques based on dataflow representations, and demonstrates these techniques on a multi-core implementation case study involving multiple, concurrently-executing image processing applications. Our techniques are based on careful representation and exploitation of frame-or block-based operations, which involve repeated invocations of the same computations across regularly-arranged subsets of data. Using these new approaches to manage block-based image data, this paper demonstrates methods to analyze synchronization overhead and FIFO buffer sizes when mapping image processing applications onto heterogeneous, multi core architectures.
RELATED WORK
Various previous efforts for reducing synchronization overhead in parallel processing environments have been reported. These techniques can be categorized into two groups -those based on compile-time scheduling techniques and those based on runtime scheduling. Runtime techniques may employ hardware acceleration logic to boost communication performance or specialized arbitration logic to handle dynamic changes in task priorities. These approaches provide reduced synchronization overhead, but increase power consumption, and also increase SOC design and validation costs due to requirements for specialized intellectual property (IP) blocks. On the other hand, compile-time techniques are more power-and cost-efficient, but require accurate estimation of execution times for computational and communication tasks.
Techniques in [1, 4, 8] exploit the structure of static schedules for iterative dataflow graphs, and reduce synchronization overhead by deriving minimal sets of synchronization operations that preserve the sequencing constraints imposed by the original dataflow graph and the given schedule. However, these techniques do not take into account parameterization of communication operations in terms of block size. In this paper, we integrate block size parameterization into communication analysis to provide a more general approach for synchronization optimization.
As a runtime technique, Monchiero et al. propose a hardware-assisted modeling of spin lock polling to reduce synchronization overhead [5] . This work carefully analyzes the effect of a hardware spin lock mechanisms on synchronization-induced contention for communication resources. However, this approach is modeled based on an assumption of unpredictable operational patterns among computational threads, and focuses more on general network-on-chip processing applications. In contrast, in this paper, we focus on the image processing domain, and develop compiletime techniques that exploit predictable behavior that is exposed by formal dataflow representations of the input applications.
For signal processing applications, the highest level data frames of blocks can often be divided naturally into lower-level blocks, which are processed through repeated sequences of blockbased operations. Such multi-level block-structured data organization is particularly common in image and video processing applications. Ko and Bhattacharyya have developed techniques for formal modeling and quasi-static scheduling of such multi-level block processing applications by building on the framework of parameterized dataflow graphs [3] .
In multi-level block processing scenarios, the lower level block sizes are significant factors in determining FIFO buffer sizes, and block processing throughput improvements due to vectorized implementation [6] . In this paper, we carefully integrate block processing with synchronization cost, and demonstrate the relevance of such integration for multi-core image processing systems. Our proposed approach can be applied as a post-analysis approach in conjunction with existing dataflow scheduling and resource allocation techniques, such as those developed in [2, 8] . Our approach can contribute also to early-stage design estimation of trade-offs between performance and buffer memory utilization.
SHARED BUFFER IMPLEMENTATION
The processing units can employ different memory management policies that are tailored towards the specific characteristics and contexts of the processing units. In our targeted platform (ARM+DSP), the ARM core provides an MMU (Memory Management Unit), and associated support for virtual memory.
In contrast, the DSP core provides direct access to physical memory. This allows for fast data token transfer within DSP memory space, but has limitations in terms of buffer memory protection and fragmentation. Shared memory regions must be handled with special care due to their synchronization requirements and larger access times. Determination of shared region buffer sizes is a critical factor influencing the efficiency of inter-core data token delivery. This paper addresses trade-offs between shared buffer configurations and inter-core communication performance.
We employ a circular buffering policy to map dataflow buffers onto shared memory regions. For inter-core communication, synchronization functionality must be coordinated carefully with circular buffer management to provide correct, efficient memory transfer of data tokens between actors (functional node in a dataflow graph: similar to task/thread) that execute on different cores.
SCHEDULING FORMULATIONS FOR FRAME-BASED PROCESSING
Multi-media applications often process data streams in terms of frames of data that encapsulate contiguous sub-regions of the enclosing streams. The example in Figure 1 shows how the frame size influences the synchronization overhead for shared buffer regions. In this example, actor is placed on the ARM core and actors and are placed on the DSP core. The communication channel between actors and is mapped onto a shared buffer region between the ARM and DSP cores. In contrast, the channel between actors and is placed in a non-shared buffer region associated with the DSP local memory space.
In Figure 1 (b), the whole image is divided into four frames, which correspond to sub-images. Each iteration of the dataflow graph of Figure 1 (a) processes a single sub-image, and therefore, four iterations of the graph are required to process a complete image. In the dataflow graph of Figure 1 , annotations next to the actor ports show the numbers of tokens produced and consumed on each actor input and output port, respectively.
Given the frame-based image processing approach illustrated in Figure 1 , the frame size (number of pixels in each subimage), which we denote by , is given by the product , where and represent the width and height of each sub-image, respectively. The image size (number of pixels in each complete image), which we denote by , can be expressed similarly by the product , where represents the number of sub-images in each image.
The value can be expressed as .
The application dataflow graph, which we denote by , processes pixels per iteration. . In Figure 1 
(b)
. A valid schedule for processing a single frame can be expressed as .
( 2 ) If is a valid schedule for processing a single frame, then a valid schedule for processing a complete image can be derived as the looped schedule .
Thus, for example, a valid schedule for the overall application represented by Figure 1(b) is given by .
In Figure 1(c-d) , the frame size is twice the value of for Figure 1(a-b) , and thus, from (1), we have that . Furthermore, since the actors , , and in Figure 1 (c) are obtained by vectorizing actors , , and , respectively in Figure 1(a) , it can be verified that schedule of (2) is also a valid single-frame schedule for the overall application represented by Figure 3(d) . Thus, from (4), we have that (5) is a valid schedule for Figure 1(d) .
MODELING SYNCHRONIZATION COST
Given an application dataflow graph, a multi-core target processor onto which the graph is to be mapped, and a shared buffer edge in the graph, we refer to the synchronization count of as the total number of synchronization operations that must be completed for reading and writing data on the edge when processing a complete iteration of the application graph. In our case study, an application graph iteration corresponds to the processing of an image frame (sub image).
We decompose the synchronization count metric into components and , respectively, where the former(6) refers to the number of synchronization operations for writing data tokens into the associated shared buffer, and the latter(7) refers to the number of synchronization operations for reading.
, (6) and ,
where LCM denotes the least common multiple operator. Here, for one synchronization write request associated with a shared memory edge , tokens are written onto the corresponding shared memory buffer. This can be viewed as a vectorized writing of all of the output data for edge that is associated with a single invocation of the source actor for . Similarly, for one synchronization read request with a shared memory edge , tokens are read from the corresponding shared memory buffer.
If represents the set of shared memory edges (i.e., the edges that are mapped to shared memory buffers), then the total number of synchronization operations required to process a dataflow graph iteration can be expressed as ,
where .
The total number of synchronization operations required in the processing of a complete image can be expressed as the product of the number of image frames in a complete image and the synchronization count for a single frame:
.
(10) Figure 1 . Impact of frame size on synchronization counts against shared buffer memory region. , Figure 1(a) , the total number of synchronization operations per graph iteration can be derived from (6), (7), and (8) as , (11) and then the synchronization count for a complete image can derived from (10) as (12) Similarly, for Figure 1(b) , the total number of synchronization operations per graph iteration and complete image can be derived, respectively, as , (13) and .
(14)
ANALYZING SYNCHRONIZATION COST
The required for synchronization as a framelevel application graph (frame processing graph) processes an image frame (sub-image) is the total time taken for synchronization associated with processing an -pixel frame. This synchronization time can be estimated as .
Here, represents the total synchronization set-up time (overhead due to common, synchronization "stub" code associated with inter-core communication) throughout execution of a single iteration (processing of a single frame) of .
is independent of the frame size ( ), and depends on the total number of required synchronization operation count( ). depends on bus architectures and synchronization methods. The term represents the time taken to process a buffer allocation request from a shared buffer region. This term can be estimated as being directly proportional to . The value of also depends on the profile of memory fragmentation in the shared buffer region at the time of the associated allocation request. This second factorfragmentation-related overhead -is difficult to predict at compile time because multiple applications run simultaneously while influencing the run-time status of the shared buffer region. During design space exploration, it can be useful to have bounded below by a minimum allowable frame size , and to view as an integer multiple of this minimum frame size. Here, if represents the ratio of the image size to the minimum frame size, then and satisfy
To analyze the synchronization performance of different frame processing configurations, it is useful to derive an estimate for the time required to process a single iteration of if -that is, if we use the minimum allowable frame size. Such an estimate can be derived as ,
where represents the frame processing graph with ;
represents the computational cost (required time) for processing of a single frame by ; and represents the data token delivery time (excluding the time required for synchronization) for the FIFO buffers associated with the edges in , and generally depends on the memory architectures employed in the target processor.
Given an arbitrary frame size (subject to (17)), the total time required to process the associated frame processing graph can be estimated as ,
and the total time to process a complete image using can be expressed as .
( 20) Here, (the "complete-Image SYNCHronization time") represents the total synchronization time to process a complete image using repeated iterations of . We model by:
,
where represents the frame processing graph that results from setting (i.e., setting the frame size to equal the image size).
. (23) (22) and (23) can be derived from (12), (14) and (16) in conjunction with . , which represents the total buffer size required to implement the frame processing graph , can be derived as ,
where and represent the sets of dataflow graph edges that are mapped onto shared and non-shared buffer regions, respectively;
represents the buffer cost (memory requirement) of the individual shared buffer edge ; and represents the buffer cost of the non-shared buffer edge . Because of the scaling of dataflow production and consumption rates as the frame size increases, we have for arbitrary that , Building on the various evaluation metrics derived in this section, we can formulate the following ratio as a figure of merit that characterizes the overhead of synchronization relative to the volume of data token transfer in a frame-based, multi-core image processing configuration:
The variation of with candidate frame sizes and associated transformations of the frame processing graph is useful to take into consideration during design and implementation of a multicore image processing system.
EXPERIMENTAL RESULTS
This paper has developed methods for analyzing the impact of shared buffer regions on data transfer among different -homogeneous or heterogeneous -processing units in a multi-core platform. Our methods are based on analyzing the design space associated with alternative frame processing configurations, and include the derivation of a new figure of merit , which helps to characterize the synchronization performance associated with a given frame processing configuration. Figure 2 shows the results of experiments that demonstrate our analysis -in particular the figure merit -on the TI Davinci platform (TI DM6446). In these experiments, we applied the following set of three applications concurrently as part of our case study on multi-application, multi-core signal processing: an MPEG-4 decoder, an alpha blending application, and a JPEG decoder.
These results show that buffer synchronization overhead plays a significant role, especially for smaller buffer sizes. Figure  2 also shows that initially, the impact of buffer synchronization decreases as buffer size increases -this is because the data transfer time becomes increasingly significant compared to the time required for synchronization functions. However, beyond a certain level of buffer size, the impact of buffer synchronization starts increasing. The main reason for this increase comes from increased contributions associated with the factor as the volume of buffer allocation requests over the shared buffer region increases. Figure 3 shows how varies in relation to buffer size. As shown in Figure 3 , the AC which is the time to process buffer allocation requests, increases rapidly as buffer size increases under simultaneous operation of multiple applications. The RE represents the time to release allocated buffers. Figure 4 shows how the data transfer rate (KBytes/sec) varies in relation to buffer size. A buffer size of 32KB provides the best transfer rate in our case study. This experiment quantifies how small buffer sizes cause high synchronization overhead because they effectively increase the frequency at which synchronization operations need to be carried out in conjunction with .
CONCLUSION
As the complexity of multi-media embedded systems increases, heterogeneous multi-core platforms are increasingly attractive from an implementation perspective. This paper has analyzed the impact of buffer size, frame processing, and synchronization performance on overall system performance, and demonstrated this analysis with experiments that involved multiple, concurrent image processing applications DM6446.
Useful directions for further work include developing the combined optimization algorithm of (21), (24), and (26); developing tool support to help in automating the exploration approaches demonstrated in the paper. --------------------------------------------------------------------- 
