Abstract-Multiprocessor systems-on-chip (MPSoCs) are evolving toward processor pool-based architecture that employs a hierarchical on-chip network for inter-processor and intraprocessor pool communication. This letter presents a systematic exploration method of the cascaded bus matrix-based on-chip network design for processor pool-based MPSoCs. It uses an evolutionary algorithm to find optimal architectures in terms of on-chip area while satisfying a given performance constraint. Since simulation is too time-consuming to evaluate the performance of complex on-chip networks during architecture exploration, we propose to prune the design space efficiently using two novel static analysis techniques: 1) bandwidth analysis considering task execution dependences, and 2) memory contention analysis for accurate performance estimation. Thanks to fast and accurate evaluation by the proposed analysis techniques, we achieved an order of magnitude speed improvement for the architecture exploration without performance loss, compared with a simulation-based approach.
Fast Communication Architecture Exploration of Processor Pool-Based MPSoC via Static Performance Analysis
I. Introduction As the system complexity grows, it is considered as promising and desirable to construct a whole system in a wellstructured form with multiple subsystems. We call such a subsystem a processor pool (PP) that is normally composed of processing elements, memories, and an on-chip network. PP-based design has many benefits such as good scalability, design reuse of subsystem, modularity, and so on [1] . They usually have two levels of communication architecture: inter-PP and intra-PP communication. Local memory access goes through intra-PP communication architecture that, thus, should support relatively short and frequent memory accesses with small latency. In contrast, inter-PP communication architecture is likely to deal with relatively infrequent but massive data transfer. It implies that the preferred configurations of the two communication architecture types quite differ from each other. Therefore, the design of communication architecture for PP-based multiprocessor systems-on-chip (MPSoCs) should consider the different wish lists of inter-PP/intra-PP communications and explore the explosively large search space. To perform the exploration, accurate performance evaluation of communication architecture is commonly resorted to simulation to account for its dynamic behavior due to resource contention. But time-consuming simulation inhibits designers from exploring sufficient design space to choose desirable solution(s) in a limited time budget. This makes the application-specific optimization of a PP-based MPSoC a challenging problem.
This letter proposes a systematic communication architecture exploration method of PP-based MPSoCs to overcome those difficulties using an evolutionary algorithm. The goal of the proposed exploration is to find optimal architectures in terms of on-chip area while satisfying a given performance constraint. We consider cascaded bus matrix architectures for both inter-PP and intra-PP communication.
A large body of research has focused on the synthesis of on-chip bus architectures using heuristics or evolutionary algorithms [2] , [3] . They, however, are not directly applicable to PP-based MPSoCs. The approach proposed in [4] is similar to ours in that the bandwidth requirement is statically estimated to prune the design space. However, they did not consider task dependences. Since cascaded bus matrix architecture can provide high bandwidth with reduced cost, several approaches, using simulated annealing [5] or mixed integer linear programming [6] , have been proposed to find an optimal topology satisfying the given bandwidth requirements but without considering latency aspect.
Even though there has been extensive research on networkon-chip (NoC) architecture exploration [7] , [8] , in our context, these works consider inter-PP communication only. There are several formal approaches to model on-chip network. At transaction level, performance analysis techniques based on the queuing theory have been proposed for hierarchical shared buses [9] and NoC [7] , [10] only focusing on single clock domain or communications between network tiles. The approach proposed in [11] also statically models contention due to shared resource access. Unlike ours, they do not rely on analysis but on temporary cycle-accurate simulation for the statistical estimation of contention.
The contributions of our work are as follows. We propose a systematic method of communication architecture exploration, considering both inter-/intra-PP communications for PP-based MPSoCs to explore the extremely huge design space efficiently. With the aid of fast but accurate two static analysis techniques, the proposed exploration method efficiently prunes invalid design space, reducing the required number of simulations during the exploration. Further, since the proposed method is based on an evolutionary algorithm, it is flexible enough to easily consider different types of on-chip network architectures.
II. Application and Architecture Models
An application is specified as a set of coarse-grain tasks communicating with each other. A task is a primitive unit of 0278-0070/$26.00 c 2011 IEEE mapping onto a processing element (PE). We distinguish two types of logical memory blocks (LMBs). One is dedicated to each task and the other is shared between tasks. A dedicated LMB is used for local memory access inside a task while a shared LMB is for inter-task communication. Each task in an application is assigned a deadline by which its execution is required to be completed once enabled.
A PP-based MPSoC architecture consists of multiple processor pools and one single global communication architecture (GCA) for inter-PP communication as shown in Fig. 1 . Each PP consists of PEs, on-chip memories, an interface to a GCA, and an interconnection network that connects hardware components. The GCA contains on-chip memories and an offchip memory interface. We denote a memory component in the architecture by a physical memory block (PMB) where LMBs of the task model will be mapped to.
A bus matrix consists of multiple master interfaces and slave interfaces. A master interface is connected to a PE or a slave interface of another bus matrix, which constitutes the network of bus matrices in a cascaded form. Since memory requests from different master interfaces may arrive at a slave interface simultaneously, they should be serialized according to a given arbitration policy. We call such a slave interface an arbitration point (AP). We assume that all bus matrices and PMBs in a PP use a single clock while PPs may have different clock frequencies from each other.
III. Proposed Architecture Exploration Method
The procedure of the static analysis-based exploration is shown in Fig. 2 . Initially, from memory access traces of the given task model, we calculate the minimum bandwidth requirements for tasks to access LMBs as indicated in Fig. 2(a) . Then, with the minimum bandwidth requirements, the task model, and the memory traces, we explore the design space by using an evolutionary algorithm, called quantum-inspired evolutionary algorithm (QEA) [12] . The goal of the QEA is to find optimal on-chip communication architectures in terms of on-chip area for a given performance constraint, which is represented as the grey box in Fig. 2(b) ; it creates solutions of the current iteration by probabilistically mutating the incumbent best architecture, which includes allocation of LMBs to PMBs, determination of bus matrix topologies, and selection of clock frequency and arbitration priorities. Then, we evaluate all solution candidates through two steps to select the best one of the current iteration.
The first stage consists of two sub-steps that prune the design space through the proposed static analysis techniques. At first, the set of generated architectures is filtered by the given design constraints: 1) the sustainable bandwidth of each AP and PMB, and 2) the on-chip area constraint for bus matrices and memories. It should be noted that the objective of this stage is not to select high quality solutions but to drop poor ones. The sustainable bandwidth of a PMB should be greater than the sum of the required minimum bandwidths of LMBs that are mapped to the PMB and may be accessed at the same time. Likewise, any AP in a valid architecture should satisfy the bandwidth requirements associated with the tasks that may pass through the AP at the same time. As another constraint, on-chip area is obtained simply by summing up the pre-calculated area of components, bus matrix, and onchip memory. We discard solutions that fail to meet either bandwidth requirements or the area constraint. Next, we evaluate the survived architecture candidates to shrink them further based on their fitness in terms of performance and area. This step requires accurate estimation of the execution time of the target application over all architecture candidates. Even though the architectures violating design constraints have been already filtered out in the previous step, there may remain a large number of solutions. Thus the use of time-consuming simulation is not appropriate in this step, which necessitates the static estimation of performance.
Afterward, the second stage selects the global best solution. Since the proposed static analysis techniques pruned the design space significantly, it is acceptable to use cycle-accurate simulation to evaluate the remaining solutions in this stage. If we obtain a better solution than the incumbent best solution, the new best solution is recorded for the next iteration of the exploration. The readers are referred to [13] for more details.
IV. Static Performance Analysis
For the ease of explanation, we assume that a task accesses a single PMB. However, this analysis can be easily extended to the case of access to multiple PMBs during task execution [13] .
A. Bandwidth Analysis
For the static analysis, we model the task execution behavior as a sequence of blocks. Shared memory accesses for inter-task communication define a block that corresponds to a partition of task execution during which no inter-task communication occurs. Then, we calculate the minimum execution time of a block m, exe min (m), as a sum of the CPU time that the block consumes and the communication time that is assumed proportional to the burst length of each bus transaction to any logical memory. The bandwidth analysis aims to estimate the minimum bandwidth requirement of each AP and PMB for an architecture candidate, considering concurrent executions of the associated blocks. To do this, we first calculate the minimum bandwidth requirement related to each block by computing the bounds of block execution time.
Let us denote the earliest and latest start times of a block m by ES(m) and LS(m), respectively. Similarly, EF(m) and LF(m) correspond to the finish times. They are related as follows:
( 
where amt(m) is the amount of memory accesses during the execution of m. Note that we need to calculate (2) only once before entering the exploration loop. On the other hand, the bandwidth requirement for an AP should be calculated whenever we create a new architecture candidate during the exploration. To compute the minimum bandwidth requirement of an AP, we find all blocks that can be concurrently executed and need the AP, and sum up the required bandwidths of these blocks using (2). The minimum bandwidth requirement of PMB can also be formulated similarly.
B. Memory Contention Analysis
To estimate the execution time of a target application on a given bus matrix architecture, it is critical to estimate the contention and arbitration delay for a single bus transaction. The proposed contention analysis is illustrated in Fig. 3 with a simple example, where three PEs are assigned a single block each. The proposed analysis begins with identifying the blocks that can be executed concurrently. In the example, initially at time instance 0, only two blocks, MA 0 and MC 0 , can be executed. Suppose that MA 0 and MC 0 are estimated to be finished at time instances 30 and 60, respectively, by the contention analysis for a given communication architecture. Then, we take the earliest finish time of the blocks as the starting point of the next analysis. Before continuing the analysis, we update the memory access counts of blocks with the remaining memory accesses that should be considered in the next analysis. At time instance 30, the second round of the analysis is performed with blocks MB 0 and MC 0 . If the earliest finish time is found to be 60, only MC 0 is left for the last analysis. In such a way, we repeat the analysis until all blocks of the application are considered. The finish time of a task is the largest finish time of blocks in the task. Also, the execution time of the application is the latest finish time of its tasks.
The heart of the proposed execution time analysis is to estimate the arbitration delay of memory accesses. In a cascaded bus matrix, a memory request must be arbitrated at every AP that lies on a path to access a target PMB. We decompose the entire arbitration delay of a memory access into partial ones experienced at each of the APs on the path. In this way, we estimate the per-AP delay for a memory access. Let us denote the average execution time of a block m by exe(m). And let α(m) be the average time for a single access of m to access a LMB including communication overhead. Then
where ACC (m) is the number of accesses to a LMB during the execution of m. If we define path(m) as a set of APs for m to access a LMB, α(m) is formulated as
where
inv(m) is the average interval between consecutive bus transactions, bl(m) is the average burst length of a single bus transaction in time, and δ (m, ap) is arbitration delay when a block m accesses a LMB through ap on path(m).
Once a block acquires all necessary APs to access a target LMB, time to access the corresponding PMB is determined by the slowest clock frequency of APs along with an access path. Next, to calculate δ (m, ap), we compute the expected time duration that a block m may be intervened by other blocks. Suppose that a fixed priority arbitration policy is used. Then, we need to consider two cases of the interference that m may experience. The first case is when a block m L with a lower priority than m is already accessing a LMB through ap and m also wants to access its LMB through ap. Then, m should wait until the ongoing transaction by m L is finished even though m has a higher priority. We denote by IL(m, m L , ap) the average amount of time that a block m has to wait due to another block m L on ap, which can be formulated as (6) The first term on the right-hand side of (5) corresponds to the probability that m L is occupying ap. The expected remaining access time for an on-going transaction of m L is simply modeled as half of the duration m L occupies ap as indicated in the second term of (5).
As the second case, we need to consider two additional scenarios a higher priority block interferes m. First, m needs to wait for the completion of requests issued beforehand by the higher priority blocks. Second, we should consider new access requests of the high priority blocks while m is waiting for bus grant. We denote by IH(m,m H ,ap) the average time that m is intervened by a higher priority block m H on ap. Then, it is
(7) The first product in the equation, identically to (5), accounts for the blocking time when m H is already using ap. The second product corresponds to the case that m issues a request when m H has been waiting prior to m. This product consists of: 1) the probability that m H is waiting already when m issues a new request, and 2) the amount of time that m should wait, i.e., time that m H occupies ap. The last product corresponds to the case that m H requests a bus during m waits the completion of the current bus transaction. In this case, m also should wait for the transaction of m H to be completed. In this product, the first two terms are associated with the average number of requests of m H during m is waiting.
To fully formulate δ(m, ap), we first identify blocks that may compete to use ap. We denote by MIH m,ap (or MIL m,ap ) a set of master interfaces that have higher (or lower) priorities than that of a master interface a block m uses. Then Therefore, to determine δ(·)s for every ap of an architecture, the iteration procedure is implemented using nested loops of iterative calculation. For example, the inner loop is associated with a single arbitration point while the outer loop is repeated until all δ(·)s of the architecture are converged. Experimental results confirmed that all parameters are converged in a few hundreds of iterations.
V. Experiments
We used two applications to verify the proposed exploration method. First, we made a synthetic application that consists of 19 tasks with 30 LMBs. The behavior of each task is modeled by generating traces of random memory accesses that have burst lengths and memory access intervals obeying the Poisson distribution with given parameters. The associated target architectures are composed of 4 PPs, each of which has 2, 5, 5, and 2 PEs, respectively, and a GCA. The second application is an industrial strength one: a picture-in-picture (PiP) application that consists of one H.264 encoder for a 4CIF-sized frame and two for a CIF-sized frame. This application consists of 39 tasks with 25 LMBs. The target architectures of the PiP 
A. Verification of the Execution Time Estimation Technique
The goodness of the exploration technique depends on the accuracy of the proposed contention analysis compared with the simulation result. To verify the proposed analysis technique in Section IV-B, we performed extensive experiments with the randomly generated traces. We investigate the effects of various parameters to the accuracy of the proposed analysis technique. The parameters include: 1) average memory access rate; 2) average burst length of a single bus transaction; 3) number of PEs; 4) cascaded bus topology; and 5) number of PMBs. For a given set of memory traces and architecture configurations varying the parameters, we obtain the execution time of an application using both the proposed analysis technique and cycle-accurate trace-driven simulation.
We performed 1000 random trace generations for each of target architecture configurations to measure the average estimation error of the proposed technique compared to simulation. We use the root mean square (RMS) error as a metric of accuracy of the static analysis. In Fig. 4 , it is observed that the RMS error between the estimated and the simulated times for a given target architecture does not exceed 3% in all cases. The error tends to increase as the average memory access interval of a task becomes shorter. Since the short interval of memory access means a high access rate, a task may experience much contention, increasing the estimation error. Similarly, the error also grows as the number of PEs increases as shown in Fig. 4(a) since more memory contention is likely to occur at APs. The similar tendency is found as the average burst length becomes shorter since an effective memory access rate grows. Fig. 4(c) shows that the errors are independent of the cascade level of a bus matrix topology. For instance, the cascade level of the architecture in Fig. 1 is 2 . As we proposed in Section IV-B, an expected delay of a memory access depends on the amount of simultaneous memory accesses from the other masters, incurring access conflicts on the APs on the In the experiments associated with Fig. 4(d) , we considered four complex architectures that consist of large numbers of cascaded APs and PMBs. The error of the estimation is still kept low because the memory accesses of tasks are distributed over APs of the architectures, and in turn, each AP of the target architecture conducts a less number of bus transactions.
B. Analysis of Exploration Results
In the second set of experiments, we compared the proposed exploration, est, with a simulation-based approach, called sim, in terms of solution quality and the exploration time. The sim approach evaluates all architecture candidates using cycleaccurate simulation throughout the exploration. This approach is expected to produce good quality solutions paying huge overhead of simulation time. The major difference between this approach and ours is the use of the memory contention analysis technique during exploration.
The performance comparison results are summarized in Table I . Note that we apply the bandwidth analysis in all approaches to reject invalid solutions immediately. Therefore, the number of generated solutions is larger than the total number of simulated solutions or the total number of analyzed solutions. The sim approach spent the longer time to converge as expected. The proposed approach, est, is faster by an order of magnitude. Nevertheless, the est approach found a solution whose quality is closer to that of sim for both applications. In this experiment, the proposed analysis technique spent about 3 s, on average, to evaluate an architecture while the simulation took about 181 s.
The proposed memory contention analysis prunes out the design space effectively, especially in case of the synthetic application. The ratio of "Rejected" among all the analyzed solutions is 63%. And among the "Accepted" solutions, less than 10% underwent simulations with negligible performance loss. On the other hand, the simulation-based approach has high ratio of rejection after paying the cost of simulation. For the PiP application, 89% of the analyzed solutions are accepted among which only 6% underwent simulation. Note that 30% of the simulated solutions were rejected even in the proposed approach. It indicates that accurate simulation is unavoidable to accurately account for the contention delay. Not all accepted architectures by the analysis are subject to simulation. A few of them is selected by the fitness evaluation of the QEA [13] . The average error between the estimated and the simulated times for architecture candidates for the synthetic application is just 5.5% while that of the PiP application is 9.1%.
VI. Conclusion
In this letter, we presented a systematic exploration method of on-chip communication architecture for processor poolbased MPSoCs. In the current implementation, we considered bus matrix architectures for both local communication network inside each processor pool and global communication architecture. To avoid excessive use of time-consuming simulation during exploration, we proposed two static analysis techniques: 1) bandwidth analysis considering task execution dependences, and 2) memory contention analysis for the underlying bus matrix architecture. The experimental results showed that the proposed analysis technique prunes the design space effectively to reduce the exploration time without performance loss compared with the simulation-based approach.
