Abstract-Computation-intensive image processing applications need to be implemented on multicore architectures. If they are to be executed efficiently on such platforms, the underlying data and/or functions should be partitioned and distributed among the processors. The optimal partitioning approach is the one which aims to minimize the inter-processor communication while maximizing the load balance. With the continuously increasing number of cores which exacerbates the demand for more complex memory hierarchies, non-uniform memory access, etc., on-chip communication has gained a significant role in taking advantage of the multicore chips. Therefore, making partitioning decisions just based on conventional performance results and without communication profiling is suboptimal. In this paper, we explore the behavior of a mesh decoder as a case study in terms of communication and computation, and propose models that allow early prediction of the application's behavior. Using these models, profiling the application for all of the input samples is not necessary anymore. As a result, communication-and computation-aware parallelization could be performed faster and easier.
I. INTRODUCTION
The implementation of image processing applications on a single-core processor requires a large amount of computation time and power. High computational complexity of these algorithms and real-time demands make single-core processors unsuitable. Thus, high performance computing systems are required to satisfy user requirements.
Today, an inevitable paradigm shift can be observed towards using multicore platform technology as a promising replacement for dedicated processors. In fact, chip multiprocessors (CMPs) or multicore technology has become the mainstream in CPU designs [1] . Multicore processors can offer significantly better performance (and lower power consumption) by running at lower frequencies mainly due to the distributed workload among the processing cores [2] - [3] . In practice, however, integrating a number of processing elements on a single chip would impose restrictions to achieve the maximal performance, since efficiently exploiting the shared hardware resources is somewhat cumbersome. In other words, multicore architectures suffer from potential resource contentions, such as shared memory and interconnection network between the cores. This brings a whole new set of issues that cannot be solved anymore in hardware alone and hence will impact the software as well [2] - [3] . As a result, an efficient parallel implementation of the software on the target architecture is a key solution in order to achieve the highest performance and a better utilization of multicore processors. However, parallelization of applications is a difficult, laborious, and time-consuming task which is usually performed manually. Thus, providing profiling guidance will help the programmer to avoid trial and error and to tackle the parallelization issues.
Trying to make use of the inherent parallelization offered by multicore architectures, partitioning the application into smaller units that can be handled by different cores is inevitable. In fact, a competent partitioning method would result in a better allocation, synchronization, scheduling, etc. of the application. Partitioning an application without full intuition of its behavior is somehow unreasonable and errorprone. Therefore, it is necessary to profile the application beforehand to have a better insight into the characteristics of the application. Traditionally, computation profiling was the only criterion to distinguish different partitioning possibilities. Although partitioning based on computation profiling tries to reduce the computational load imbalance among the cores, the major drawback of this method is the risk of high interconnection traffic overhead, leading to a degraded performance of the application. By partitioning an application with respect to communication, a systems designer can try to confine the communication streams to stay inside a processor core or its caches. This goal is achieved by mapping the heavily communicating entities onto the same core, and avoiding major communication streams to be visible on the network. Therefore, these streams will be carried out by a much more efficient mechanism, such as through the core's registers, local cache, or scratchpad memory [4] . In fact, communication profiling appears to be a crucial determinant in making partitioning decisions and a reasonable partitioning strategy is the one which tries to find a trade-off between computation and communication workload balance.
In this paper, we propose a functional partitioning strategy to address the above two criteria. A SIM (Scalable Intraband Mesh) decoder [5] has been considered as a case study, which is a subdivison-based wavelet decoder for compressing the representation of discretized scalable 3D geometries. The proposed partitioning method starts with communication profiling which plays an extremely significant role in communication-efficient software implementations. Based on the task-to-task communication results and the estimated execution time for each input mesh, an average behavior for the decoder has been identified and modeled. To the best of our knowledge, no such analysis has been carried out so far. By modeling the decoder in terms of communication and computation complexity, a reasonably efficient partitioning scheme could be determined which aims to minimize the inter-processor communication while maximizing the load balance at the same time to overcome performance degradation in parallel processing architectures. Although this strategy has been applied here particularly to the partitioning problem for a wavelet decoder, it is widely applicable for general parallelization problems as well, for instance making mapping and scheduling decisions.
The rest of this paper is organized as follows. In Section II, a brief overview of the SIM decoder as the case study is provided. Section III is devoted to communication and computation profiling as the main tools for parallelization. Section IV presents the proposed partitioning strategy along with the experimental results and discussion. Finally, the conclusions are provided in Section V.
II. CASE STUDY: 3D WAVELET DECODER
In the current case study, we focus on the Scalable Intraband Mesh (SIM) coding technique of [5] , which is a wavelet-based semi-regular mesh codec exploiting the intraband statistical dependencies between the wavelet coefficients. This is achieved by wavelet-based decomposition of the semi-regular input mesh which outputs an approximation band (i.e. the base mesh) and a set of detail subbands S i , where 1≤i≤J denotes the decomposition level. The approximation is losslessly stored in the bitstream, whereas the detail subbands are subjected to embedded scalar quantization, octree-based encoding of the significance maps, and efficient context-based arithmetic entropy coding. The module profiled in our work is the decoding system. The advantage of using SIM decoder is that it allows for progressive or scalable decoding which is defined in terms of resolution levels. Since the resolution level is a dynamic feature, a partitioning decision is to be found which is compatible with all resolution levels.
The first processing step in the decoder is given by entropy decoding of the bit-stream. Specifically, information related to the significance and sign bits is decoded using a set of context models derived by the state of the firstring neighbors. Assume that each detail subband S i has p max magnitude bit-planes and one sign bit, with bit-plane p max being the most significant one. The p-th bitplane of a particular subband, with p max >p ≥0, is recovered during bit-plane decoding. This process consists of three decoding passes, which we refer to as significance, non-significance and refinement pass [5] .
The output of the bit-plane decoding step consists of a set of binary maps. These maps serve as input for inverse quantization, by which the wavelet coefficients of each subband S i are reconstructed in the middle of the corresponding quantization cell. Given the base mesh and the decoded detail subbands, the original 3D mesh can be easily reconstructed. The means to achieve this is through a subdivision procedure which consists of a splitting step and an averaging step. During the splitting step new vertices are introduced in the middle of the existing edges, while in the averaging step the positions of the newly-introduced vertices are computed as a weighted sum of neighboring vertex positions. For the SIM codec, the weights are provided through the Butterfly stencil [6] .
III. PROFILING
Partitioning modern image processing technologies is a challenging endeavor considering the advanced coding techniques, including computationally demanding functions and complex data dependencies. In general, there are two ways to partition applications over a multiprocessor environment, namely data partitioning and functional partitioning [7] . Data partitioning for SIM decoder involves dividing the input 3D mesh based on the number of vertices into submeshes (separate 3D objects) to be decoded in parallel across a multicore platform. Later on, a mesh composer would aggregate the results from the different processors. The main issue that needs to be resolved for data partitioning is the minimization of the communication overhead for data dependencies between partitions. Furthermore, the scheduling of the data partitions has to be considered, since interdependencies impose restrictions on the order in which partitions can be processed [7] . Functional partitioning, on the other hand, implies decomposing the application into individual computation-intensive tasks to be distributed over a multiprocessor platform.
There are two major problems that arise in functional partitioning of the SIM decoder on a multicore platform:
(1) Inter-processor communication overhead; (2) Inherent computational load imbalance among the cores.
Therefore, it is necessary in the first step to profile the application in order to determine the communication and computation costs. Communication and computation cost analysis is required in order to monitor the run-time behavior of the application. Then, based on the results obtained, a communication and computation-aware partitioning approach has to be exploited to solve these problems efficiently.
A. Communication Profiling
Performance optimization of multiprocessor applications relies heavily on a good knowledge of the communication behavior inside the program. In fact, communication through banks of shared memory is implicit and special tools are needed to discover it. Therefore, communication profiling, which means investigating the run-time behavior of an application in terms of memory access patterns, is an essential performance optimization tool in the hands of the designer.
Application partitioning (parallelization) and mapping are two main reasons as to why communication profiling is an important issue. The power dissipation and latency in interconnection networks are strong functions of interprocessor communications. By considering the data flow between different functions of an application which is only visible after communication profiling, better clustering and mapping decisions could be made which aim to minimize the inter-processor communication. This is possible by assigning the heavily communicating functions onto the same processor, or at least the nearest neighbour in order to avoid excessive communication overhead between the processors. This would result in a better utilization of the network's bandwidth, minimization of the network's power consumption, and a better performance by avoiding the high communication latencies. PinComm, which is described in detail in [4] , is a communication profiler which automatically analyzes an application at the same time that it runs and provides an automatic measurement of the communication patterns incurred by the application. Based on the communication that flows between major functional blocks of the program, a dynamic data flow graph (DDFG) is constructed which would help the designer to perform parallelization in a communication-aware way [4] .
PinComm is implemented based on Pin [8] , a run-time binary infrastructure from Intel which is used for instrumentation. Binary instrumentation is a technique in which instrumentation code is inserted into the application's executable [9] - [11] . Pin allows modular instrumentation of executables on several platforms (namely IA-32, x86-64, Xscale, and Itanium) through the use of plug-ins. PinComm profiler is a plug-in which instructs Pin to intercept all memory accesses and all function calls. When memory writes are intercepted, the identifier of the currently executing function (and the thread number, if applicable) is stored in a lastwritten-by table together with the memory address that the write instruction has referenced. When a read instruction occurs, PinComm looks up the address in the last-writtenby table to determine the producer of this piece of data.
Hence, PinComm concludes that communication has taken place between two functions, the consumer being the current function, and the producer being the function found in the last-written-by table. The size of the communication stream is specified by the size of the memory read instruction. For function calls and returns, the call trace can be extracted from the call stack, which will later be processed into a call tree. Since PinComm is a run-time profiler, it can be connected to any program running on a host PC, with any combination of inputs and parameters. Thus, the designer would be able to visualize communication inside compiled sequential or parallel C/C++ programs [4] .
B. Computation Profiling
If the computational load is not fairly distributed in a many-core system, load imbalance can arise. The performance of an unbalanced system will be lower compared to that of a balanced system, as the cores that are more loaded will need more time to provide their output. The inherent resulting delay from these overloaded cores will lower the overall performance. Therefore, estimating the computational complexity of an application is a crucial prerequisite in order to make improved partitioning decisions which aim to distribute the workloads more fairly. Computation profilers provide statistics about timing behaviors (e.g. function calls and execution times) of an application while it is running. In this paper, the TI Code Composer Studio (CCS) development tool [12] and Sniper Multi-Core Simulator [13] , have been used to obtain the execution time of the inputs for DSP and x86, respectively.
IV. PROPOSED PARTITIONING STRATEGY AND EXPERIMENTAL RESULTS

A. Communication-aware Partitioning
In order to perform a profiling-based partitioning, the communication flows incurred by the SIM decoder have been measured by PinComm for eight polygonal 3D meshes, namely feline, moai, orang malu, rabbit, santa, screwdriver, sword, and venus. It is important to note that these are the conventional meshes used in the assessment of mesh coding systems, and they are not intrinsically similar. Different quality i.e. number of bits to extract from each of the detail subbands at the highest resolution (14, 16, and 18) and resolution levels (1 to 6) for the mentioned encoded meshes have also been taken into account. The maximum allowed number of resolution levels is characteristic to each mesh, and depends on the number of vertices/polygons at the highest resolution. Table I lists the number of vertices for each input mesh per resolution level.
The resulting DDFG for all of the samples with quality equal to 16 and resolution level equal to 5 is illustrated in Figure 1 . Each arrow indicates a major communication flow (at least 1% of the total communication flow for a specific quality and resolution level) in terms of MB which is produced by the origin function (producer) and consumed by the target function (consumer). Table II enlists the communication magnitudes corresponding to the arrows in Figure 1 . As it is depicted in the figure, the application could be easily clustered into two groups to keep the communication overhead minimum. The first group consists of four functions: Split (splitting step in the subdivision process), SignificancePass and NonSignificancePass (identifying the corresponding passes in bit-plane decoding), and DecodeWaveletCoefficients (bit-plane decoding process). The second group contains five functions, namely, SubdivideButterfly (subdivision process), MakePolygonIndexes (creates a list of polygons at the current decomposition level), ComputeNormalsHigherResolutions (computes the normal to each polygon, as required when passing to a local coordinate space), Reconstruct (reconstructs the mesh given the base mesh information and the detail subbands), and ReapplyAveraging (averaging step in the subdivision process).
Since the number of vertices varies per mesh and resolution levels, we have also calculated the amount of communication per vertex (Comm V ) in order to have a fair comparison between different meshes. This value has been obtained for each mesh by dividing the total inter-functional data flow throughout the application by the number of vertices for a specific resolution level. The results have been presented in Fig. 2(a) -(c) for quality equal to 14, 16, and 18, respectively. As could be seen in the figures, the behavior of the curves is quite similar for all of the 24 experiments, indicating that the amount of Comm V is not significantly dependant on the type of the mesh being considered, but strongly affected by the resolution level and quality. These results encouraged us to explore the possibility of modeling the behavior of the decoder in terms of Comm V . Hence, the average Comm V for each quality, and also the total average over the whole measurements have been obtained and shown in Fig. 2(d) . The curve fitting tool in MATLAB has been used to find the best equation for the total average Comm V (dashed curve in Fig. 2(d) ), with parameters R and Q as resolution level and quality, respectively.
This function could be used to predict the magnitude of Comm V for various meshes in a timely and cost-effective manner before the actual decoding takes place.
Based on the observations from Fig. 2 , functional partitioning of the application for R={1,5,6} is likely to be a better decision than the other resolution levels. In other words, when Comm V is high, functional partitioning seems to be less rational due to the higher inter-processor communication. But as we have discussed earlier, an efficient partitioning strategy depends not only on the communication Table 1 . overhead, but also upon the computation load balance among the cores. Therefore, it is not possible to make an accurate partitioning decision without taking the impact of load balance into account. This issue is further discussed in the next section. 
B. Computation-aware Partitioning
If the computational load is not fairly distributed in a many-core system, load imbalance can arise. Estimating the computational complexity of an application is a crucial prerequisite in order to make improved partitioning decisions which aim to distribute the workloads more fairly. Therefore, we have used the TI Code Composer Studio simulator for TI C64X+ processor to obtain the execution time for the meshes mentioned in the previous section. The amount of computation per vertex (Comp V ) has been obtained for all the 24 combinations of meshes and qualities, just as was the case with Comm V . The average Comp V for each mesh over three different qualities is illustrated in Fig. 3(a) on a logarithmic scale. The very similar shape of the curves allows us to easily model the application in terms of computation load based on R and Q, and independent from the mesh being decoded. Mathematically fitting the data using MATLAB's curve fitting tool results in
The same process has been repeated with the Sniper Multi-Core Simulator which is a next generation parallel, high-speed and accurate x-86 simulator. Fig. 3(b) shows the average amount of Comp V for each mesh on a logarithmic scale, which could be estimated by
These analytical models could be applied to predict the application's behavior in terms of performance for different meshes. Fig. 3 clearly exhibits the exponential growth of Comp V with resolution level, leading to a rapid increase in computational load on the cores. Hence, functional partitioning of the application seems to be necessary in higher resolution levels in order to mitigate the load imbalance among the cores. But any decision-making approach without considering the communication behavior of the application, could lead to inaccurate or even wrong conclusions. Therefore, we have to incorporate the computation and communication behaviors of the application in order to obtain a trade-off between them. In fact, allocation of the functions to the cores is rather insightful by integrating the computation and communication data captured for each mesh during profiling.
Putting the communication (Fig. 2(d) ) and computation data (Fig. 3) together, special areas in the partitioning problem space for this application could be found that are interesting to mention:
(1) R={2,3} where the lower computation load decreases the possibility of load imbalance. On the contrary, there exists high communication flows which could result in the inter-processor traffic. Therefore, data partitioning of the meshes suffices and additional functional partitioning is not necessary. (2) R={5,6} where the computational load is nonnegligible. On the other hand, it is also evident from Fig. 2(d) that the magnitude of data flows degrades with larger resolution levels, leading to a lower inter-processor communication. To avoid the load imbalance among the cores, further functional partitioning is advantageous in addition to the data partitioning.
Note that the partitioning suggestions in the above two categories are completely dependant on the hardware architecture and the designer's preferences. Fig. 4(a) shows a sample of DDFG in which the computation and communication magnitudes expressed in terms of percentage are shown next to the functions and arrows, respectively. The details mentioned in this graph are helpful assets to make efficient partitioning decisions. Starting with the most basic architecture with two identical cores, clustering the application into two blocks (as it is shown with the gray boxes in the figure), seems to be the best decision in terms of communication overhead. Although with this partitioning strategy, the communication latency between the processors would be the least, it suffers from a great load imbalance: The computational load for the left block (74.82%) is almost three times greater than that for the right block (25.18%). Hence, a better clustering decision is to allocate DecodeWaveletCoefficients to one core and all of the other functions to another core (Fig. 4(b) ) to reach a trade-off between communication and computation overhead. In this way, just 26.77% (solid lines) of the total communication is visible in the interconnection network and the rest (dashed lines) which are inside a cluster could be handled by the processor's internal memory.
The added value of using a communication profiler is most pronounced while considering an architecture with three cores. Several solutions exist as how to split the right block further in order to keep the computational load balanced. But by considering communication cost as well, an improved partitioning strategy could be obtained which aims to cut the less weighted communication paths between the functions while balancing the load. Keeping this in mind, the partitioning method presented in Fig. 4(c) seems to be reasonable since with a slight load imbalance among Core 1 (19.88%) and Core 3 (25.18%), the inter-processor network traffic has been kept unchanged with respect to the previous architecture (Fig. 4(b) ).
The ratio cut partitioning method presented in [14] could also be applied for the obtained profiling results to determine the most satisfactory partitioning decision. However, it is important to notice that finding a suitable partitioning approach is highly dependent on the multicore architecture being used. In other words, different multicore implementations require different allocation strategies, depending on the number and characteristics of the cores, memory structure, interconnection network, etc.
V. CONCLUSION
The purpose of this paper is to prove the effectiveness of communication profiling coupled with computation profiling in order to make optimal use of multicore systems. Mitigating the load imbalance has been the only conventional criterion for most partitioning strategies, but the efficiency of using PinComm as a communication profiler is verified while trying to keep the computation and communication costs balanced. Therefore, a partitioning strategy has been proposed for a SIM decoder that combines data partitioning with functional partitioning. This approach is mainly based on communication and computation profiling of the application in order to examine the bottlenecks, i.e. interprocessor communication and load imbalance among the cores, and adopt decision methods to overcome them. The results show that it is possible to model the application in terms of communication and computation which provides important suggestions for an optimized application design and deployment in a parallel setting. 
