ABSTRACT Fine-grained multithreaded models can provide more opportunities for system performance improvements, but they face large amount of communication overhead. To reduce the communication overhead to improve performance, message aggregation and communication pipeline have been proposed and their advantages have been discussed in previous works. However, when exploiting them in reallife applications, it is challenging to maximize their advantages due to their unrevealed problems. This paper investigates the two communication optimization techniques from the view of their problems to fulfill the studies. We have found that when applying the techniques, message aggregation may have the deadlock problem and latency problem, and communication pipeline may face the ineffective problem and prolog and epilog problem. Based on the basic requirement of allocating the two techniques on an initial schedule, we then present general solutions to avoid or handle the problems. To obtain a minimum schedule length for an initial schedule, we further develop a binary particle swarm optimization-based finegrained communication optimization method to allocate message aggregation and communication pipeline for best performance. The method is also integrated into light and efficient simulink compiler for embedded application multithreaded code generator to generate high quality codes. Experimental results on both synthetic and real-life applications demonstrate the efficiency of the proposed method. BPSO, communication optimization, fine-grained model, MPSoC. 
applied to the scenarios of determined mapping and scheduling. For communication optimization techniques which can be used independently, Wang et al. [5] exploited retiming technique to preprocess tasks and transform intra-iteration dependency into inter-iteration dependency to remove communication overhead. However, the analysis is based on real-time periodic applications with different application scenarios from this work. Wesolowski et al. [6] presented a fine-grained communication optimization library named Topological Routing and Aggregation Module (TRAM) to improve performance for NoC. Its key idea is to aggregate and route small units of communication into larger units to reduce the impact of per message overhead and the amount of bandwidth consumed by the message header. However, it is designed for the NoC architecture and its mechanism heavily depends on the architecture, which is not suitable for general bus-based MPSoC. From the architecture level, communication optimization is considered from bus contention [7] , memory access contention [8] , and etc., which is not the optimization level of this paper.
To reduce the communication overhead more specifically, we decompose one communication time into a sending operation and a receiving operation. Each operation further contains communication startup time and communication transfer time. Communication startup time is brought by the hardware configuration (like Direct Memory Access (DMA) initialization) for the communication and kept constant for certain hardware architectures. Communication transfer time is the actual data transfer time between two threads, which is proportional to the data size. Targeting at reducing communication startup time and communication transfer time, message aggregation and communication pipeline have been proposed in [9] [10] [11] [12] . Message aggregation combines the communications with the same sources and destinations into one and thus reduces the communication startup time. Communication pipeline utilizes the feature of DMA that it can perform communication transfer without the intervention of processors after initiation, to preprocess some communication tasks to overlap communication and computation. In previous works [9] [10] [11] [12] , the advantages of message aggregation and communication pipeline are the main focuses and it seems that applying these two techniques can guarantee performance improvements. However, through further investigations, we have found that inappropriately allocating the techniques on an initial schedule may deteriorate the system performance since they will change task scheduling sequences or task graph topologies. This performance degradation further leads to two challenges for exploiting message aggregation and communication pipeline:
1) Under what cases will message aggregation and communication pipeline lead to performance degradation. 2) How to handle these cases to fully utilize the advantages of message aggregation and communication pipeline to achieve best performance. Therefore, it is necessary to first find and analyze the cases where message aggregation and communication pipeline will degrade performance (in this work we call ''message aggregation problem'' and ''communication pipeline problem''), and then propose corresponding solutions to handle the problems to achieve most performance improvements.
As applying message aggregation and communication pipeline can be abstracted as a resource allocation problem under constraints, meta-heuristic algorithms can be utilized to solve the problem due to their ability to obtain high quality in acceptable solving time. Well-known meta-heuristics include Simulated Annealing (SA), Genetic Algorithm (GA), ACO, Particle Swarm Optimization (PSO) and etc., which have all been exploited to solve the allocation problem [2] , [13] [14] [15] . Among them, PSO is a swarm intelligence algorithm and has shown promising results due to its effectiveness and simple implementation for complex problems [16] . As a variant of the traditional continuous PSO, the Binary PSO (BPSO) [17] can solve the problem with solutions represented by binaries, which fits the allocation problem.
Based on these observations, this paper studies the scenarios where message aggregation and communication pipeline may cause performance degradation and presents general solutions to handle the problems. Integrating the solutions to an initial task schedule, we then propose a BPSO-based fine-grained communication optimization method (BFCO) to apply the two powerful communication optimization techniques to maximize their performance benefit.
The main contributions of this paper are listed as follows. 1) We investigate the application scenarios where message aggregation and communication pipeline may degrade the overall system performance. Message aggregation may face the deadlock problem and the latency problem, and communication pipeline may face the ineffective problem and the prolog and epilog problem. 2) To avoid these problems when applying message aggregation and communication pipeline, we then introduce some general solutions to the problems, which can be independently utilized for an input schedule. 3) To handle the overall allocation of message aggregation and communication pipeline, we further propose BFCO to efficiently apply these two communication optimization techniques considering all the problems and solutions to obtain a minimum schedule length for a certain schedule. 4) The proposed method is further integrated with Light and Efficient Simulink Compiler for Embedded Application (LESCEA) multithreaded code generator and the extensive experimental results on both synthetic and real-life applications show its efficiency on performance improvements of MPSoC. The rest of the paper is organized as follows. Section II introduces the background model and the previous introductions of message aggregation and communication pipeline. Section III shows our investigations about the problems of the techniques. Section IV first describes the general solutions to the problems and then presents BFCO to maximize their advantages for the best performance. Section V discusses the experiments. Section VI concludes the whole paper and points out future directions.
II. BACKGROUND A. SIMULINK MODEL
A Simulink model [18] represents the functionality of a target system, including software threads and hardware architecture. The functional modeling of an application is based on an Abstract Clock Synchronous Model (ACSM [19] ), which can easily express parallelism and pipeline by partially ordered intra-and inter-dependencies. Details about Simulink models can be found in [19] and [20] .
A Simulink model is made up of the following three basic components.
• Simulink Block represents a function that takes n inputs and produces certain outputs. User-defined functions (S-function), discrete delays, and predefined blocks such as mathematical operations can be classified as Simulink blocks. Basic Simulink blocks contain functional blocks (white circles in Fig. 1 ) used to compute data and communication (sending and receiving) blocks (gray circles in Fig. 1 ) used to explicitly model communications and allow optimizations. Besides, discrete delay blocks (white square in Fig. 1 ) are inserted to avoid deadlocks when building the model.
• Simulink Link is a one-to-many link, which connects one output port of a block to one or more input ports from other blocks, and represents a dependency relation between different blocks. If there is a link from B0 block to B1 block, we say that B1 depends on B0, denoted by B0 → B1. A Simulink link starting from a sending block S and ending with a receiving block R from different processors is referenced as a communication vector, denoted by S → R.
• Simulink Subsystem can contain blocks, links, and other subsystems to represent hierarchical composition. Having combined the scheduling and hardware information in the system model, an MPSoC Simulink model can be represented by a two-layered hierarchical structure. The system layer describes a system architecture that is made up of CPU subsystems and inter-subsystem communication channels between them. The subsystem layer describes a CPU subsystem architecture that includes a set of Simulink blocks and links between each other, and intra-subsystem communication channels between each other. In this model, an application is represented by a Directed Acyclic Graph (DAG) where the nodes, i.e. the Simulink blocks, represent tasks and the edges, i.e. the Simulink links, represent task dependencies. An application is executed for many cycles, and a cycle means from some point all blocks have been executed once. Moreover, in our model, communications and unrelated computations can be executed in parallel but communications should be serialized.
B. MESSAGE AGGREGATION
Message aggregation can reduce communication startup time by combining communication vectors with the same sources and destinations [9] , [12] . A typical example is shown in Fig. 2 , where communication vectors S0 → R0 and S1 → R1 in Fig. 2 (a) are combined into one communication vector S01 → R01 in Fig. 2 (b) as S0 and S1 are on the same processor P0, and R0 and R1 are on the same processor P1. As shown in Fig. 2(c) , before aggregation, the communication between P0 to P1 requires two startup time (the small thin rectangle in the sequence), while after aggregation it only requires one startup time, which reduces the total execution time on P0. Therefore, in theory N message aggregations correspond to (N − 1)/N startup time saving. When there are many communication channels and the communication startup time is relatively high compared to communication transfer time, message aggregation can be most useful. Note that to highlight the startup time part when discussing the problems about message aggregation in next sections, we will show the small thin rectangles in the schedule sequence. When discussing problems not about message aggregation, to simplify the analysis we will not show the rectangles but will leave space for them.
C. COMMUNICATION PIPELINE
Communication pipeline can overlap communication with its subsequent computation in order to hide the communication time [9] [10] [11] utilizing the hardware feature of DMA that after initiated by CPU, the hardware can complete the communication transfer without the intervention of CPU. A typical example is shown in Fig. 3 , where the system has a sending task, a receiving task and two computation tasks on two processors, and the number in parentheses after the task type name denotes the cycle of data which are currently processed. In Fig. 3(a) , the sending task S sends data of the computation task F0 to the receiving task R for its later usage by F1. Fig. 3(b) demonstrates an example of communication pipeline described in [9] , where each task is executed for 3 cycles. 
III. PROBLEMS OF MESSAGE AGGREGATION AND COMMUNICATION PIPELINE A. PROBLEMS OF MESSAGE AGGREGATION
Although message aggregation has the advantage as described in Section II-B, we notice that there are two problems when applying message aggregation to achieve performance improvements: the deadlock problem and the latency problem.
1) DEADLOCK PROBLEM
Even though an application is represented by a DAG, there may be loops after aggregating messages which will lead to deadlocks. In [9] , we have mentioned that if there is a path between two sending/receiving tasks to be aggregated, there will be deadlock after aggregation. By further investigating various topologies and initial schedules, we have found that the origins of deadlocks can be divided into 3 kinds: 1) Paths between the aggregated tasks. This is the case discussed in [9] that if there is a path between any two sending/receiving tasks to be aggregated, then there must be deadlocks after aggregation. A simple example is shown in Fig. 4(a) . As there is a path from R0 to R2, after aggregating R0 and R2 the path is closed to a loop which incurs a deadlock. Moreover, when the number of tasks and processors increases, the problem will become more complicated. A complex example is shown in Fig. 4(b) , where there are 3 communication tasks S0, S1, and S2 to be aggregated on the same processor. Fig. 4 (c) represents a possible message aggregation allocation for Fig. 4 (b) by an upper triangle matrix. If the row task and the column task are aggregated, the corresponding matrix bit is denoted by ''MA''. If there is a path from the row task to the column task, the corresponding matrix bit is denoted by ''PATH''. We can see that this allocation meets the demand that S0 and S2 are not aggregated since there is a path between S0 and S2. As there is no path between S0 and S1, S0 and S1 are aggregated, and the same goes for S1 and S2. However, this allocation is actually infeasible. The reason is that when S0 and S1, and S1 and S2 are aggregated respectively, S0, S1 and S2 are all aggregated simultaneously, which forces the infeasible aggregation of S0 and S2. Therefore, although the allocation seems feasible, it actually has the deadlock problem caused by paths between the aggregated tasks. 2) No paths between the aggregated tasks but with the conditions of forming loops. An example is shown in Fig. 4(d) . There are two sets of communication vectors that can be aggregated respectively and within each set there is no path between the corresponding sending tasks or receiving tasks, i.e. S0 cannot reach S2 and S1 cannot reach S3. However, after the aggregation, a deadlock loop forms consisting of S02
In this case, although there is no path between the communication tasks, the dependencies between the communication and the computation tasks each hold part of the loops which form the loop after aggregation.
For a large number of tasks and processors, the problem will also be complicated as there will be a large loop containing many tasks across several processors, which will be difficult to estimate and resolve. 3) Schedule sequence after aggregation. Even though the task graph topology is acyclic after message aggregation, there will be deadlock as well due to the improper scheduling sequence on each processor. An example is To simplify the analysis, we omit the computation tasks and only show the communication tasks. When S0 and S1 are aggregated, to ensure the task dependency, S1 is aggregated to S0, the latter one in the schedule sequence. For R0 and R1, R0 is aggregated to R1, the former one in the schedule sequence. However, the new schedule sequence leads to a deadlock consisting of S01 → R01 and S2 → R2 and each processor waiting for receiving data from the other. Therefore, after message aggregation, how to decide the schedule sequence of the aggregated tasks is an important problem.
2) LATENCY PROBLEM
One message aggregation combines several communication tasks of different locations in each schedule list to one task, which may cause two kinds of latency problem for performance degradation. First, data may not be sent to the target processor as soon as they are available. As the example in Fig. 4 (f) shows, before message aggregation, the available data of F0 can be immediately delivered to other processors, and the receiving processors can begin to execute immediately. However, after message aggregation, the sending time of S0 is delayed, which will affect the execution of the receiving processors. Second, the essence of message aggregation is to coarsen the granularity of the messages. Therefore, it is possible that some advantages of the fine-grained communications are lost. We can also see in Fig. 4 (f) that after message aggregation, the benefit of the overlap of S0/S1 and F0/F1 is lost and the finish time of S0/S1 is delayed, which will cause the latency of the overall performance.
B. PROBLEMS OF COMMUNICATION PIPELINE
Communication pipeline can effectively hide the communication transfer time, but it also has problems which may degrade the system performance. We name the problems as the ineffective problem and the prolog and epilog problem.
1) INEFFECTIVE PROBLEM
In some cases, using communication pipeline cannot hide the communication transfer time and even degrades the system performance. To better illustrate the reasons for this problem, we first give the following definitions.
• TI S . The relative amount of the finish time interval between successive cycles for the sending task S. As the example in Fig. 5 (a) shows, TI S is the finish time interval between S(i) and S(i + 1).
• Pipelined iteration. A pipelined iteration is when the code in while(1) executes once after applying communication pipeline as shown in Fig. 3 (a) and tasks in one pipelined iteration may execute in different cycles.
• TEI R . As there may be idle time that tasks do not execute in one pipelined iteration, we name TEI R as the effective pipelined iteration time which contains the corresponding receiving task if using communication pipeline. As the the example in Fig. 5 shows, TEI R is the time when R and F1 execute in one pipelined iteration. Based on these definitions, we see that when TI S ≥ TEI R , using communication pipeline cannot improve the performance. In the example in Fig. 5 (a), TI S is larger than TEI R and there is idle time in the pipelined iteration after communication pipeline is applied. Therefore, although using communication pipeline can overlap communication and computation, the idle time still brings overhead. On the contrary, recall 
2) PROLOG AND EPILOG PROBLEM
To make tasks pipelined, there must be a prolog phase and an epilog phase, which add overhead to the system performance. As the application is running for many cycles, this overhead can be outperformed by the transfer time saved. However, if the cycle number is small, the overhead may not be hidden. As the example in Fig. 5(b) demonstrates, after applying communication on R, the schedule sequence on P1 can be divided into 3 phases: prolog, pipelined iterations, and epilog. In this case, although the scheduling sequence does not have the ineffective problem, using communication pipeline degrades the overall performance. The case is simple and we can calculate out that the transfer time saved by communication pipeline is smaller than the cost brought by prolog and epilog phases. However, if there are more than two processors, it is almost impossible to compare the time saved and the additional overhead because it is difficult to predict the impact of the parallel execution of tasks on total performance.
IV. COMMUNICATION OPTIMIZATION TECHNIQUES ALLOCATION METHOD
After investigating the problems in Section III, we have developed some general ways to avoid or handle the problems, which will be first discussed in Section IV-A. Then in Section IV-B, we propose BFCO to allocate message aggregation and communication pipeline on an initial schedule to obtain a minimum schedule length integrating the solutions in Section IV-A.
A. GENERAL SOLUTIONS TO THE PROBLEMS 1) MESSAGE AGGREGATION DEADLOCK PROBLEM SOLUTION
1) Paths between the aggregated tasks. As this kind of deadlock may happen between each pair of tasks to be aggregated, we can first check if there are paths between any pair of sending (or receiving) tasks to be aggregated and those with paths will not be aggregated.
As there will also be cases like the complex example in Section III-A.1, we develop an algorithm to solve the problem as demonstrated in Alg. 1.
Algorithm 1 Deadlock Path Problem Solution
Require: A message aggregation allocation matrix Ensure: A message aggregation matrix without deadlock path problem 1: repeat 2: Find the aggregated task sets SET according to the allocation matrix 3: for set i ∈ SET do 4: all_no_path = 1 5: for task t i , t j ∈ SET (i = j) do 6: check if there is any path between t i and t j
7:
if has path between t i and t j then 8: Find each task t k satisfying at least one of the following constraints: 9: (1)matrix(t i , t k ) == MA 10:
Build a task set TMP containing all t k
14:
if TMP is not empty then 15: Randomly choose a task t m ∈ TMP 16: Change t m 's constraint value into NOMA end for 22: if all_no_path then 23: set i does not have deadlock path problem 24: end if 25: end for 26: until Any set has deadlock path problem
The algorithm first divides the sending tasks into several sets so that each set contains a group of aggregated tasks (line 2). Then, in each set we traverse all pairs of tasks (line 5) and check if there is path between each pair of tasks (line 6). If no paths exist between any pair of tasks in one set, then the set is feasible (line 23). If all sets are feasible, the whole message aggregation allocation is feasible. However, if there is any path between some pair of tasks in one set, we record the pair of tasks, find the other tasks that meet the constraint that the corresponding bits are ''MA'' (line [8] [9] [10] [11] [12] [13] , and randomly choose one of the other tasks to turn the corresponding ''MA'' bit into ''NOMA'' bit (line [14] [15] [16] [17] , which eliminates one message aggregation on a pair of tasks. The evaluation and the elimination continue until the whole allocation is feasible (line 26). Fig. 6 shows an example of this algorithm with 4 tasks S0, S1, S2 and S3, and there is a path between S0 and S1 and no paths between any other two. Message aggregation is first allocated as presented in Fig. 6(a) . Under this allocation, S0, S1, S2 and S3 are all forced to be aggregated. However, S0 and S1 cannot be aggregated, and we find S3 and S2 with the bit matrix(S0, S3) = MA and matrix(S1, S2) = MA. Then we randomly choose from S0 and S3 to turn its ''MA'' to ''NOMA''. In this way, only S1, S2 and S3 are aggregated and the allocation is feasible as shown in Fig. 6(b) . 2) No paths between the aggregated tasks but with the conditions of forming loops. Unlike the first kind, it is difficult to evaluate which aggregations will cause the loops before the tasks are aggregated because different aggregations greatly affect the final topologies. Therefore, to avoid this kind of deadlock problem, for an input message aggregation allocation, we iteratively check the loop condition after aggregation, trace the aggregated communication vectors in the loop, and randomly eliminate the aggregation of some communication vectors until there is no loop after aggregation. 3) Schedule sequence after aggregation. According to the policy of message aggregation, each sending task to be aggregated will be combined with the last sending task to be aggregated in the schedule sequence on the processor, and each receiving task to be aggregated will be combined with the first receiving task. Therefore, for each aggregated sending task, if it can reach the other tasks of the same processor through the graph topology, it should be placed before the tasks. If other tasks of the same processor can reach it, it should be placed after the tasks. The aggregated receiving tasks undergo an opposite process of the sending tasks.
An example is illustrated in Fig. 6(c) to handle the problem in Fig. 4(d) . We first check the schedule sequence on P0, where the only aggregated task is S01. As R2 has a path to S01, S01 should be placed after R2 and the sequence is kept unchanged. Then we check P1, where the only aggregated task is R01. As S2 has a path to R01, R01 should be placed after S2 so the order of R01 and S2 is exchanged. Using the above policy, the aggregated tasks will be rearranged into feasible schedules for each processor.
2) MESSAGE AGGREGATION LATENCY PROBLEM SOLUTION
This problem is tightly coupled with the scheduling sequence and cannot be easily quantified. For an initial message aggregation and schedule sequence, a direct idea is to find high quality solutions with search-based meta-heuristics. Meta-heuristics can iteratively generate and evaluate solutions. While searching for better solutions, the latency problem is gradually solved.
3) COMMUNICATION PIPELINE INEFFECTIVE PROBLEM SOLUTION
In Section III-B.1, we have found the numerical relationship to measure this problem. Therefore, for a communication pipeline allocation under an initial schedule, we first find all the receiving tasks using communication pipeline and their corresponding sending tasks. Then for each such receiving task, we measure its TI S and TEI R . If TI S ≥ TEI R , the communication pipeline is ineffective and should be eliminated. Otherwise, the communication pipeline is effective and remained. After eliminating all communication pipelines with the ineffective problem, we obtain an effective communication pipeline allocation as much as possible.
4) COMMUNICATION PIPELINE PROLOG AND EPILOG PROBLEM SOLUTION
As it is difficult to evaluate communication pipeline on which tasks cause the prolog and epilog problem, we develop a heuristic algorithm to handle the problem, as shown in Alg. 2.
For a certain schedule, we can calculate the schedule length of the application recursively using the policy that each task can only start after its predecessors and the tasks before it on the same processor finish. Therefore, after eliminating the ineffective communication pipeline, we can compare the schedule length after applying communication pipeline for R i ∈ R with communication pipeline do 6: Find its location rloc in its processor 7: Calculate the number of tasks num that can reach R before rloc 8: if num > maxnum then 9: maxnum = num 10:
end if 12: end for 13: Eliminate the communication pipeline on Rmax and get new cur_cp 14: sched_length = calc_sched_length(cur_cp) 15 : end while with the one before communication pipeline (line 4). If the schedule length is larger than the one before applying communication pipeline, we gradually eliminate the communication pipeline application on the receiving tasks which may lead to longer prolog and epilog to reduce the schedule length (line 5-line 12). For a schedule list on a processor, if there are many tasks before the receiving task R that can reach R, then when R is applied by communication pipeline, these tasks will all have to be preprocessed which lead to a long prolog. Therefore, when eliminating the communication pipeline application, we find the R with the most such dependent tasks and eliminate its communication pipeline, and this process continues until the schedule length with communication pipeline smaller than that without communication pipeline.
B. BFCO
Based on the advantages and disadvantages of message aggregation and communication pipeline, we propose BFCO to allocate message aggregation and communication pipeline for a certain schedule. BFCO exploits the BPSO metaheuristic to tackle the problem of message aggregation and communication pipeline application. BPSO is used for the following reasons.
1) As there are no certain principles on how to exploit message aggregation without incurring the latency problem, we intend to use a random-search based metaheuristic algorithm to solve the problem due to its probabilistic nature and optimality. 2) Of all the population-based meta-heuristics, PSO is easily to implement and requires few parameters to tune. It has a binary version BPSO which meets the demand that message aggregation allocation can be encoded into binary forms. Moreover, it is hard to crossover or mutate the binary representation of message aggregation because of its complicated constraints discussed in Section III-A, so genetic-type algorithms are not suitable but BPSO just fits. Fig. 7 shows the flow of BFCO to allocate message aggregation and communication pipeline. The whole idea is to use BPSO to allocate message aggregation iteratively, and under each message aggregation, communication pipeline is allocated as much as possible. The problems and the solutions discussed above are all considered in the whole flow. In this way, the solution space for the BPSO is limited to the feasible message aggregation while the performance can be mostly optimized, which keeps the presented method with both high quality and low time consumption. To represent the BPSO solutions, we use an upper-triangle matrix to represent the message aggregation on all sending tasks, which has been used to express the problems and solutions in previous sections shown in Fig. 4 and Fig. 6 . Note that the matrix is an upper-triangle because aggregating the row task with the column task and aggregating the column task with the row task have the same effect. Each matrix bit can have 4 possible values: MA, NOMA, PATH, and DIFFP.
• MA: the row task and the column task are aggregated.
• NOMA: the row task and the column task are not aggregated.
• PATH: there is a path from the row task to the column task, so the row task and the column task should not be aggregated.
• DIFFP: the row task and the column task are on different processors, so it is impossible that the row task and the column task are aggregated. With the solution representation, the flow consists of three phases: initialization, new solution generation, and ending and velocity update. 1) Initialization. This phase initializes a swarm of S particles and S is set by the users according to the problem specifications. Each particle consists of its current position, i.e. current solution, x(i) ∈ {0, 1} n , its local best position, i.e. local best solution, x * (i) ∈ {0, 1} n , and its velocity v(i) ∈ R n (1 ≤ i ≤ S). In this phase, x(i) is initialized and also set as x * (i). The velocity v(i) of each particle is set as 0. To initialize x(i), as ''DIFFP'' bits and ''PATH'' bits are decided by the task mapping and application topology, which will be kept constant throughout the BPSO process, we can first determine these bits. We traverse the matrix and check if a row task and a column task are on the same processor. If not, the bit is set as ''DIFFP''. Otherwise, we continue to check if the receiving tasks of the row task and the column task are on the same processor. If not, the bit is also set as ''DIFFP''. Then, we check if there is path between the row task and column task. If there is, the bit is set as ''PATH''. As a path between the sending tasks will lead to a path between the corresponding receiving tasks, we do not need to check the path conditions between the receiving tasks. After determining the ''DIFFP'' and ''PATH'' bits, the other bits in the upper-triangle matrix are randomly set as ''MA'' or ''NOMA''. 2) New solution generation. For the first generation, this phase makes the initial solutions feasible and adds communication pipeline allocation. For other generations, this phase first performs ''bit change'' to generate new solutions based on previous generations, then makes the new solutions feasible and adds communication pipeline allocation. After we achieve a feasible solution containing both message aggregation and communication pipeline allocation, we can set the local best solution with the smallest fitness value in this generation, and the global best solution with the smallest fitness value of all local best solutions. To further clarify this phase, we divide this phase into three processes: bit change, message aggregation feasibility, and communication pipeline application.
• Bit change. In the standard BPSO, each bit in each x(i) is set as ''1'' with a probability of the sigmoid function
otherwise, the bit is 0. In our problem, ''MA'' is corresponding to the ''1'' case while ''NOMA'' to the ''0'' case. After this process, new solutions can be obtained based on the updated velocities.
• Message aggregation feasibility. For an input message aggregation, we check if it has the deadlock problem considering all the three kinds of deadlocks: paths between the aggregated tasks, no paths between the aggregated tasks but with the conditions of forming loops, and schedule sequence after aggregation. Note that for the first kind of deadlock, as we have consider the simple path conditions for any pair of sending/receiving tasks, this process only needs to consider the complex conditions caused by forced aggregation.
• Communication pipeline application. 
towards both own best position and the global best position, which is respectively represented by the learning factor for the cognitive component c 1 and learning factor for the social component c 2 . r 1 and r 2 are two random values that are uniformly distributed in [0, 1] . ω is the inertia weight and we set it linearly decreasing for each iteration according to
where t max is the maximum number of iterations, t is the current iteration number, and ω 1 and ω 2 are the maximum and the minimum ω. Moreover, to ensure convergence of the heuristic, v is bounded in an interval of [−v max , v max ]. We use (4) from [21] and n is the number of the feasible bits of the message aggregation.
V. EXPERIMENTS
To show the efficiency of the proposed method for message aggregation and communication pipeline allocation, we exploit both synthetic and real-life benchmarks, including a task graph generated from Task Graphs For Free (TGFF) [22] , several task graphs abstracted from real applications from Standard Task Graphs (STG) [23] and an VOLUME 6, 2018 actual H.264 baseline decoder application. All applications are designed to execute on a 4/8/16-CPU platform (denoted by 4/8/16P). As no previous works specifically propose methods about the message aggregation and communication pipeline allocation, we compare the proposed method with [9] and [5] on schedule length. Reference [9] first comprehensively introduces message aggregation and communication pipeline and gives some simple directions on their applications, so we denote it by MACP. Its message aggregation and communication pipeline allocation policy is to allocate communication pipeline as much as possible and allocate message aggregation to the tasks with communication pipeline. As the deadlock discussion about message aggregation is not sufficient, we add the constraints discussed Section III-A to avoid infeasible message aggregation allocation for large graphs. Reference [5] utilizes retiming to perform task-level pipelining to overlap communications and computations for communication overhead reduction, which is similar to our communication pipeline, so we denote it by PIPE. As [5] is based on periodic tasks while our work does not have this feature, we exploit the idea but modify the algorithms so it is comparable with our work. In the following subsections, we first introduce the experimental platform in Section V-A and then discuss the experimental results in Section V-B.
A. PLATFORM 1) HARDWARE PLATFORM
The whole implementation is running on a 64-bit Ubuntu 4.0.4 with Intel Core i5 CPU at 2.3G Hz and 4GB RAM. The experimental MPSoC platform is with flexible configurations and processor scalability as shown in Fig. 8(a) . The platform contains at most 16 CPU subsystems, a memory subsystem, a peripheral subsystem and an interconnection subsystem. Each CPU subsystem uses a 32-bit local bus matrix to connect one processor with other local components. The processor type is configured as a 32-bit 7-stage pipeline CKCore RISC processor [24] without data cache. The memory subsystem uses a 64-bit local bus matrix to connect on-chip global SRAM and off-chip DDR2 SDRAM. These three subsystems are connected with Distributed Memory Server (DMS) interconnection subsystems respectively through a Memory Service Access Point (MSAP) [25] implemented by DMA. The DMS acts as a server that provides the communication and synchronization services to the clients. Each MSAP delivers data transfer requests issued by its corresponding subsystem to other MSAPs via the control network. In this paper, memory-related constraints (e.g. memory size constraint, memory bandwidth constraint and so on) are not considered for now.
2) SOFTWARE PLATFORM
The programs are implemented by C language, and compiled and linked by gcc. In case of related BPSO parameters, we set c 1 = c 2 = 2, ω 1 = 0.9, and ω 2 = 0.1. The swarm size varies from 4-20 according to the scale of the graphs and the number of maximum generations t max is 100. The variables are set empirically and users can adjust according to the actual situations.
The proposed communication optimization method has been integrated with the Simulink-based MPSoC design platform-LESCEA multithreaded code generator [20] . LESCEA takes a Simulink-modeled application as an input, generates a set of multithreaded C codes and builds software stacks on targeting hardware architecture. As shown in Fig. 8(b) , the general multithreaded code generation flow contains five main steps: task mapping, task scheduling, communication optimization, thread code generation and hardware dependent software (HdS) adaption. 1) Task mapping. Tasks are allocated to processors. In this work, each task represents a thread and there are multiple threads on each processor. After this step, we get the start of LESCEA. 2) Task scheduling. The execution sequence of tasks on each processor is determined after this step. After this step, we obtain an initial schedule for the communication optimization techniques.
FIGURE 9.
Results for the small DAG generated from TGFF.
3) Communication optimization. Message aggregation and communication pipeline are allocated on communication tasks using the proposed BPSO-based method. 4) Thread code generation. After determining the mapping and scheduling result, this step generates a set of C codes, including memory declarations and function calls related to the task scheduling results, and maps the memory space and function parameters. 5) HdS adaption. This step generates main function code for threads and initializes communication channels for CPU subsystem, as well as generates Makefile which links threads and HdS library.
B. EXPERIMENTAL RESULTS

1) TGFF
TGFF is a widely used DAG generator and users can easily generate tasks with different computation and communication features. We use it to generate a small DAG with 16 tasks and 22 edges. To test the efficiency of the proposed method under different computation and communication amounts, we set it with Communication-toComputation Ratios (CCRs) consisting of 0.1 (denoted by MSP), 1 (denoted by MEP) and 10 (denoted by MLP). Moreover, to evaluate the effect of different ratios of communication startup time and transfer time, we set two ratios 1:1 (denoted by SL) and 1:10 (denoted by SS) for each CCR. The graph is executed for 10 times. As the DAG has only 16 tasks, we evaluate it for the 4/8-CPU platforms.
The experimental results are shown in Fig. 9 . From the results, we see that the proposed BFCO can achieve better performance than both previous works MACP and PIPE under different computation and communication features for a given schedule. Compared to PIPE, there is an average of 2.25% performance improvement for MSP, 6.51% for MEP, and 15.19% MLP, under different communication features and different numbers of processors. Compared to MACP, the improvement value is 1.66% for MSP, 1.70% for MEP, and 3.68% for MLP. For both comparisons, the improvement percentage increases with CCR, which indicates that the proposed method is effective with graphs with larger communications. As PIPE considers only pipelining tasks to reduce communication transfer time and has not taken reducing communication startup time into account, BFCO considers both communication components and achieves better performance. MACP considers both communication components, but it applies communication pipeline as much as possible and has not noticed the ineffective and prolog and epilog problems, while BFCO utilizes a meta-heuristic algorithm to iteratively find a high-quality allocation of the two techniques considering their problems. Although BFCO may not give the optimal solution due to the simplicities when dealing with their complicated problems, we think it can still outperform many other methods.
To further show the effectiveness of our proposed method, we also calculate the average communication transfer time and startup time per processor, as shown in the lower three subfigures of Fig. 9 . We can observe that BFCO shows at most 33.59% smaller average communication time than PIPE and at most 20.29% than MACP, which mostly contributes to the performance improvements. A confusing fact is that PIPE is originally proposed to totally remove inter-processor communications, but in our experiment there is still communication overhead. The reasons are three fold. First, PIPE only considers the communication transfer time but neglects the communication startup time which cannot be overlapped by computation. Second, PIPE is initially designed for streaming applications where computations are much larger than communications, so computations can totally overlap communications. However, in our applications, there are cases where communications are much larger than computations, and thus computations cannot overlap all communications. Third, after we modify the algorithms to make it comparable with BFCO, to minimize the schedule length not every task is retimed into the prolog phase, which trades some communication overhead for overall performance.
To further analyze the results, we consider communication startup time and transfer time separately. For startup time, compared to PIPE, BFCO reduces the startup time as high Compared to MACP, the highest reduction is 52.12% for 4P and 18.85% for 8P, which indicates the effectiveness of BFCO on communication pipeline allocation on different applications. The reason for the improvement percentage for 8P smaller than 4P is that as the number of processor increases, the communication dependencies become complicated, which leads to more chances to expose the problems of message aggregation and communication pipeline. Moreover, although in a small number of cases, BFCO shows all or part of the communication overhead larger than MACP, the total performance is still better than MACP because the idle time is reduced due to the schedule sequence changes.
2) STG
STG is a benchmark for evaluation of multiprocessor scheduling algorithms. It includes both random DAGs and actual application DAGs. We exploit the actual application task graphs: robot control (denoted by RBT) and a part of fpppp in the SPEC benchmarks (denoted by F4P) as the representations. Robot control has 88 tasks and 131 edges. SPEC fpppp has 334 tasks and 1145 edges. As STG provides only the task computation time and task relations but lacks the communication time, we randomly allocate communication time for task graphs based on the same CCRs and communication ratios as Section V-B.1. Each application is also executed for 10 times.
The experimental results for RBT are shown in Fig. 10 . From the results, we see that BFCO also achieves better performance than both previous works under different computation and communication features for a given schedule. Compared to PIPE, there is an average of 8.69% performance improvement for MSP, 9.73% for MEP, and 20.08% MLP, under different communication features and different number of processors. Compared to MACP, the improvement value is 11.19% for MSP, 7.47% for MEP, and 11.90% for MLP. A main difference between the results of RBT and the random generated graph from TGFF is that for RBT, BFCO can obtain similar percentage of performance improvements compared to both PIPE and MACP while the TGFF-generated graph shows much more performance improvements than PIPE but not much improvements than MACP. That is, MACP cannot guarantee solution qualities under different applications.
To further analyze the reasons for performance improvements, we have also calculated the average communication startup time and transfer time as shown in the lower three subfigures in Fig. 10 . We have found that BFCO has smaller communication overhead than PIPE. However, in some cases MACP can reduce much more communication overhead than BFCO but its total performance is still not improved. One reason is that the communication cost reduction mainly comes from communication pipeline, but MACP has not considered the side effects of communication pipeline that although reducing communication overhead, the prolog and epilog time accounts for a large portion in the total performance. The other reason is that even though message aggregation can reduce the startup time, MACP has not considered its latency problem which will cause the performance degradation.
The experimental results for F4P are shown in Fig. 11 , and BFCO still achieves better performance than both previous works under different computation and communication features for a given schedule. Compared to PIPE, there is an average of 6.16% performance improvement for MSP, 20.94% for MEP, and 27.90% MLP, under different communication features and different number of processors. Compared to MACP, the improvement value is 26.17% for MSP, 32.52% for MEP, and 30.88% for MLP. As F4P is a large graph with hundreds of tasks and thousands of edges, the characteristics of the three methods for comparison are more evident. BFCO and PIPE show a stable performance trend, while MACP shows unstable solution quality. The average communication time per processor is shown in the lower three subfigures in Fig. 11 and the communication overhead trend is also similar to RBT. The experimental results for H.264 are shown in Fig. 12 . The results of BFCO on the real application on a real platform demonstrate better performance compared to MACP and PIPE for a given schedule as well. Compared to PIPE, there is a 13.83% total performance improvement for 4P, 5.58% for 8P, and 4.33% for 16P. The overall communication cost reduction is 5.28% for 4P, 3.22% for 8P, and 1.53% for 16P. Compared to MACP, the total improvement value is 14.59% for 4P, 7.66% for 8P, and 4.87% for 16P. The communication cost reduction is 7.59% for 4P, 1.98% for 8P, and 1.75% for 16P. The performance improvement and communication cost reduction is not as high as the results of TGFF-generated graphs and STG graphs. The major reason is that compared to the above graphs, H.264 has a large number of tasks and its precedent dependencies are the most complicated. When using BFCO to allocate message aggregation and communication pipeline, it encounters all the problems discussed in Section III which constrains the widely allocation of the communication optimization techniques. Moreover, when dealing with the problems, BFCO utilizes simple and direct methods, which may miss some chances for optimal solutions. Even though, BFCO can achieve better solutions than previous methods MACP and PIPE.
Besides, we have plotted the solving time of PIPE, MACP, and BFCO. Among the three methods, MACP consumes the least execution time, because it does not require iterative computation. PIPE takes the longest solving time because it exploits ILP to solve the problem and the time consumption grows exponentially with the increasing problem scale. BFCO takes larger time than MACP because BPSO requires iterations to find final solutions, and takes less time than PIPE because BPSO does not traverse almost all the solution space like ILP and the BPSO solution space of message aggregation is smaller than that of the task-level pipeline in PIPE.
Based on the above experimental results, the proposed BFCO can achieve the best performance compared to PIPE and MACP within acceptable solving time.
VI. CONCLUSION
Message aggregation and communication pipeline are two fine-grained communication optimization techniques which can effectively reduce communication overhead of MPSoC. This paper gives an in-depth discussion about their problems in application, and then proposes some general solutions and BFCO to allocate the two techniques under a given task schedule considering the problems. The method is also integrated into LESCEA multithreaded code generator VOLUME 6, 2018 for high quality code generation. Experimental results on both synthetic and real-life applications demonstrate the efficiency of the proposed method compared to previous works [5] , [9] . In this paper, message aggregation and communication pipeline allocation targets at an initial schedule which has a great impact on the overall system performance. In the future, we will study how to allocate the two techniques along with task scheduling for better performance. 
