This article studies the scheduling of real-time streaming applications on multiprocessor systems-on-chips with predictable memory hierarchy. An iteration-based task-FIFO co-scheduling framework is proposed for this problem. We obtain FIFO size distributions using Pareto space searching, based on which the task-toprocessor mapping is obtained with the potential FIFO allocation being taken into account; then, the FIFOto-memory allocation is optimized to minimize the total memory access cost; finally, a self-timed throughput analysis method that considers memory and direct memory access controller contention is utilized to analyze the throughput. Our methods are validated by a set of synthesized and practical applications on different platforms.
INTRODUCTION
Modern streaming applications belong to real-time applications that often have strict timing requirements. To guarantee the real-time performance, it is necessary to analyze the worst-case performance. However, traditional processors use a cache that is complex in structure, resulting in larger chip size, more power consumption, and lower access speed [Banakar et al. 2002] . In addition, the memory access is unpredictable because of the occurrence of cache miss, making it hard to estimate the worst-case performance. To overcome these problems, the predictable memory hierarchy, which, for example, uses scratch pad memories (SPMs) [Banakar et al. 2002] rather than caches, is gathering more and more attention. Like the cache, SPM is usually implemented using SRAM that provides high access speed. The difference is that the SPM guarantees single-cycle access latency, whereas the cache cannot provide such a guarantee due to cache miss. In contrast to caches, SPMs need to be explicitly controlled by the compiler or programmer, making it critical to design appropriate algorithms to make full use of them.
Streaming applications, such as video en/decoding, voice processing, communication protocols, and software-defined radio, generally operate on large or indefinite sequences of data items. To schedule and analyze the worst-case performance of streaming applications on a platform, a computation model is required. The expressivity of synchronous Data flow (SDF) [Lee and Messerschmitt 1987] fits well with the features of streaming applications, and hence synchronous data flow graphs (SDFGs) are widely used in the literature [Stuijk et al. 2007; Geilen 2010] and also in this work. In an SDFG, tasks communicate by the FIFO allocated to each edge. The throughput of the SDFG is influenced by the FIFO size (i.e., FIFO size distribution) [Stuijk et al. 2006a ]. On multiprocessor systems-on-chips (MPSoCs) with a predictable memory hierarchy, memories on different levels differ in capacity and latency, so the FIFO size distribution and FIFO allocation have a great impact on system performance. In addition, the task allocation also has significant influence on task parallelism, and hence the throughput, making it important to combine these aspects in the schedule. In this article, we investigate how to exploit the MPSoC with a predictable memory hierarchy that comprises SPMs and off-chip memory to optimize the schedule of streaming applications modeled by SDFGs such that the throughput is maximized.
Though there are many works about SDFG scheduling on MPSoCs [Sriram and Bhattacharyya 2012; Pino et al. 1995] and data allocation on multiple memory banks/modules [Leupers and Kotte 2001; Zhuge et al. 2002; Zhang et al. 2010 ], these works have not considered SDFG scheduling with FIFO sizing and allocation. Here, we propose an iteration-based task-FIFO co-scheduling (ITFCS) framework to schedule streaming applications on MPSoCs while taking into account the memory hierarchy. The novel contribution of this work is a method that solves SDFG scheduling on MPSoCs with a predictable memory hierarchy without assuming that FIFOs all fit in the local SPM. In addition, in our method, the instance collocation rule [Bilsen et al. 1994; Moreira and Corporaal 2014] that binds each task to only one processor is obeyed. This rule provides many advantages [Bilsen et al. 1994 ], such as simplifying data management, reducing memory consumption, and avoiding graph transformation. An earlier version of this work appeared in Tang et al. [2015] . The current article extends the scheduling method to construct more compact schedules while taking into account various resource conflicts; the FIFO allocation algorithm is simplified as well.
The proposed method is implemented in SDF3 [Stuijk et al. 2006b ] and is evaluated by a set of practical applications and synthesized SDFGs. The effectiveness of the proposed method is demonstrated by comparing the throughput generated by the proposed method and other algorithms, including the load balancing algorithm [Stuijk et al. 2007; Ambrose et al. 2013] , the highest access frequency first algorithm [Zhang et al. 2010] , the blocked schedule-based throughput analysis method [Tang et al. 2015] , and the MILP-based method [Damavandpeyma et al. 2011] .
Throughout the article, we use the following notations. Z, Z + , and Z + 0 denote the set of integers, positive integers, and nonnegative integers, respectively. We use boldface capitals to denote vectors/sets and corresponding italic lowercase letters to denote elements in them. For a vector or set, we use | · | to denote the number of its elements.
The remainder of the article is organized as follows. In Section 2, we discuss the related work. The models and definitions are described in Section 3. In Section 4, we formalize the problem to be solved. The algorithm is elaborated in Section 5, and experimental results are presented in Section 6. We conclude in Section 7.
RELATED WORK
Scheduling of directed acyclic graphs (DAGs) on multiprocessors is extensively studied in Sriram and Bhattacharyya [2012] and Sinnen [2007] , which focus on scheduling without and with communication overhead, respectively. These works have established the foundation for SDFG scheduling. However, SDFGs [Lee and Messerschmitt 1987] differ significantly from DAGs in several aspects. For example, SDFGs support both cyclic dependencies between tasks and multirate dependencies. Therefore, SDFGs are more complex than DAGs, and the DAG scheduling method cannot be applied to SDFGs directly. In Sriram and Bhattacharyya [2012] , SDFGs are scheduled by converting them into equivalent homogeneous synchronous data flow graphs (HSDFGs) and further transforming the generated HSDFGs into acyclic precedence graphs (APGs) [Sriram and Bhattacharyya 2012] . Clustering is used in Pino et al. [1995] to reduce the scheduling complexity by clustering tasks in an SDFG into various groups, each of which is an indivisible scheduling element. After clustering the SDFG into a new consistent SDFG with smaller graph size, this new SDFG is transformed into an APG on which DAG scheduling algorithms are used. Because the SDFG is converted to an APG in the preceding methods, instances of the same task may be allocated to multiple processors in the schedule and therefore the instance collocation rule [Moreira and Corporaal 2014] is violated. Hence, these methods cannot be applied to the problem solved in this work. In Stuijk et al. [2007] and Ambrose et al. [2013] , a load balancing method is proposed to determine the task-to-processor assignment of the SDFG by balancing the computation load, communication bandwidth, and memory consumption. In these works, instances of the same task are bound to the same processor. However, the memory hierarchy is not taken into account.
Memory access is a key factor affecting system performance. Recently, memory allocation, including data and code allocation, has attracted great attention. Leupers and Kotte [2001] investigate allocating data to dual banks of a single processor, and Zhuge et al. [2002] extend it to multiple memory modules by proposing the variable independence graph to model the data parallelism more accurately. Zhang et al. [2010] investigate how to schedule tasks and partition data onto multiprocessors with virtually shared SPMs. The authors proposed High Access Frequency First (HAFF) to allocate variables to SPMs for a given schedule. We adapt HAFF to our problem and use it as a comparison in this work to justify the importance of FIFO allocation and the effectiveness of the proposed ITFCS framework. Che and Chatha [2010] and Choi et al. [2012] investigate how to cope with the SPM for SDFGs. Che and Chatha [2010] study how to use the SPM of a single-processor machine to overlay the task code of the application modeled by an SDFG such that the execution cost of the application is minimized. This work is extended by Choi et al. [2012] to multiprocessors and data overlay. Damavandpeyma et al. [2011] propose an MILP formulation for scheduling with code and data prefetch, assuming the data are moved between the off-chip memory and SPM. However, these works consider dynamic memory overlay, whereas our work combines static FIFO allocation and dynamic memory access together.
MODELS AND DEFINITIONS
This section introduces the platform model and the application model for specifying the MPSoC and the streaming application. Some relevant concepts and definitions about the SDFG are presented as well.
Platform Model
This work studies how to schedule streaming applications on MPSoCs with a predictable memory hierarchy, of which the memory access is predictable, thus enabling worst-case performance analysis. We typically consider the predictable memory hierarchy that is comprised of SPMs and off-chip memory that are controlled by the compiler or programmer. The platform model is abstracted from CompSoC [Goossens et al. 2013] and is defined as follows. Definition 3.1 (Platform Model). An MPSoC with a predictable memory hierarchy is composed with a set of processors, memories, and the interconnect. We model it as a five-tuple: PM = (T, OCM, NIC, ini, d, s) , where T is a finite set of tiles, OCM is the offchip memory, and NIC is the interconnect. Each tile t ∈ T is a pair: t = ( p, spm), where p is the processor and spm is the SPM. We use M = {spm 0 , spm 1 , . . . , spm |T|−1 , OCM} to represent the shared memory that is available to all processors on the platform and P = {p 0 , p 1 , . . . , p |T|−1 } to represent the set of processors on the platform. ini, d are functions, ini : ini( p, m) represents the setup time (in clock cycles) for one memory access, and d( p, m) represents the memory access latency (in clock cycles per word) of processor p for memory m. We use the linear model ini + d * n to represent the memory access time (in clock cycles), where n is the data size. s is a function, s : M → Z + , representing the memory capacity (in words).
In the platform, each tile comprises an instruction memory (IMEM), a data memory (DMEM), an SPM, and a direct memory access controller (DMA). The IMEM and DMEM are used for hosting code and internal data. The DMA is used to move data between the local SPM and remote memories, including SPMs on other tiles or the off-chip memory [Goossens et al. 2013] . The DMA enables parallel computation and memory access. The platform is a distributed shared memory system, where each processor can access the memories on the same tile directly and remote memories by the use of the DMA. Therefore, both the SPM and the DRAM in the platform are sharable memories. We call the access of local SPM local access, the access of SPM on other tiles remote access, and the access of off-chip memory off-chip access. Generally speaking, the latencies of local access, remote access, and off-chip access differ an order of magnitude. Figure 1 illustrates one example platform consisting of two tiles. We assume that the size of each SPM is 24 words; the set up time is zero; and the latencies of remote access and off-chip access are one and two clock cycles per word, respectively. The article uses this platform configuration to elaborate the example.
The interconnecting architecture is a critical part for the system-on-chip, especially for data-intensive applications. The network-on-chip (NoC) [Benini and Micheli 2002] is a typical interconnect infrastructure. The NoC can be designed to have various shapes, such as a 2D mesh network [Hu and Marculescu 2004; Goossens et al. 2013] . The NoC has higher bandwidth and scalability compared to a traditional bus-based interconnect. Data transport in the NoC has two styles: contention free or connection oriented [Stefan et al. 2014] . Connectionless packet switching NoCs typically do not offer latency and bandwidth guarantees, providing only best-effort service, whereas the connection-oriented, circuit-switching NoCs (e.g., the dAElite [Stefan et al. 2014]) establish the route before data transportation, thus providing guaranteed bandwidth and latency. This article considers connection-oriented, circuit-switching NoCs such that the timing behavior of data transport can be accurately analyzed. To avoid communication contention, any two communication transactions using routes that partly or totally overlap should mutually exclude in transportation time, meaning that only one communication transaction is allowed to transfer on conflict routes at each time. Since the NoC uses a connection-oriented, circuit-switching mechanism, communications are guaranteed to be contention free. Since multiple connections may share the same link, accesses of the link should be arbitrated to avoid contention. We assume that the NoC uses a TDM scheme [Stefan et al. 2014 ] to allocate the link bandwidth to different connections, thus providing guaranteed bandwidth through assigning each connection with fixed time slices in the TDM time wheel. The transfer delay on a given route depends on the data length and the number of hops of the route. If the link is shared by multiple connections, the delay is also relative with the number of slices allocated to the connection.
Application Model
DAGs are widely used in the literature to model applications. Recently, data flow graphs like SDFG and scenario-aware data flow graph (SADFG) [Geilen 2010; Damavandpeyma et al. 2013 ] gained great attention due to their powerful combination of expressivity and analyzability. This article uses SDFGs to model real-time streaming applications like software-defined radio and multimedia applications. SDFGs can capture the application execution feature (e.g., multirate execution) and also provide some useful analytical properties e.g., deadlock, repetition vector, and memory requirement analysis), making it more attractive than other computation models.
Definition 3.2 (SDFG). A synchronous data flow graph is a directed graph and is denoted by G = (V, E), where V is a finite set of nodes or vertices representing tasks or actors of an application and E is a finite set of directed edges denoting the communications between tasks. Each node v ∈ V is associated with a cost c(v) representing the worst-case execution time (number of clock cycles) needed to complete an execution of the task. Each edge e ∈ E is defined as a tuple (src, p, dst, q, iniTok, tokSiz) , where src is the source task, p is the production rate, dst is the destination task, q is the consumption rate, iniTok is the number of initial tokens on the edge, and tokSiz is the token size (in words). We use the notions src(e), p(e), and so on, to denote elements of each edge e. We also refer to edge e as the output edge of task src(e) and the input edge of task dst(e). When a task starts execution, it consumes q(e i ) tokens from each input edge e i and produces p(e o ) tokens to each output edge e o when it finishes execution.
Scheduling is the process of mapping tasks onto the platform, ordering the executions of tasks bound to the same processor, and determining task start/finish times. If communication latency and contention are considered, edge scheduling [Sinnen 2007] should also be taken into account. SDFGs are multirate computation models, so tasks may execute in different frequencies and fire different numbers of times in the schedule. The SDFG iteration and repetition vector can capture these features of the SDFG. The iteration is also related to the throughput. The throughput of an SDFG is defined as the number of iterations finished per unit of time, representing how fast the input data stream can be processed.
Definition 3.3 (SDFG Iteration
). An SDFG iteration is defined as the process of executing each task the minimum positive number of times so that the token count on each edge returns to the initial value.
Definition 3.4 (Repetition Vector). The repetition vector R of an SDFG with n tasks numbered from 0 to n − 1 is a column vector of length n, and the k-th element of R equals the number of instances of task k in an iteration. For a task v, we refer to the number of its instances by R(v).
The repetition vector can be calculated by solving the balance equation [Lee and Messerschmitt 1987] . For an SDFG, if the balance equation has nontrivial solutions [Lee and Messerschmitt 1987] , then R exists and the SDFG is said to be consistent [Lee and Messerschmitt 1987] . An incorrectly constructed SDFG may be deadlocked while executing. Our work only considers SDFGs that are consistent and deadlock free. A consistent SDFG can always be converted to an equivalent HSDFG [Sriram and Bhattacharyya 2012] , in which all rates are equal to one. However, this conversion can result in exponential increase in graph size. The repetition vector provides information on how many times each task has to be executed in an iteration. Every time the task is started, we say that one instance of it is fired. R(v) is also referred to as the instance number of task k in an iteration.
While executing the SDFG, its edge needs to store the output data. Since the data stored by the edge is consumed by the destination task of the edge in a FIFO style, the edge is often modeled as a FIFO. The throughput of an SDFG depends on the FIFO size of each edge. Since the token size of each edge may differ, we use FIFO size expressed in words in this article. We also use the FIFO size distribution to denote the size of each FIFO. Different distributions vary in throughput, which can be captured by the FIFO-throughput Pareto space, each point of which records the distribution and its throughput.
Definition 3.5 (FIFO Size Distribution). The FIFO size distribution of an SDFG G = (V, E) is a function, fs : E → Z + , representing the size (in words) of the FIFO allocated to each edge. We also use fifo(e) and fs(e) to represent the FIFO and FIFO size of edge e ∈ E.
It should be noted that an application iteration also implies data access with respect to FIFOs of the input/output edges of each task. Since we use FIFOs for intertask communication, the number of FIFO writes/reads is the same as its source/destination task occurrence in an iteration for each FIFO.
An edge in the SDFG whose source task and destination task are the same is called a self-edge, representing a constraint on autoconcurrency. This article does not consider FIFO requirements of self-edges. However, it is straightforward to incorporate it in the method. To simplify the elaboration, in the remainder of this work, the SDFG has no self-edges unless stated explicitly. It should be kept in mind that autoconcurrency is not possible due to the instance collocation rule (i.e., each task is bound to one and only one processor), so all task instances should be executed sequently. Figure 2 shows an example of an SDFG, which is used to elaborate the problem solved in this article. We assume that each task in this figure has an execution time of three and that each edge has a token size of one. The repetition vector of the SDFG is [6, 1, 2, 3, 13] T , meaning that in one iteration, tasks a 0 , a 1 , a 2 , a 3 , and a 4 have to execute 6, 1, 2, 3, and 13 times, respectively.
PROBLEM FORMALIZATION
This article investigates how to schedule an SDFG on an MPSoC while exploiting the predictable memory hierarchy. The problem is stated as follows: given a streaming application modeled by an SDFG and the platform configuration modeled according to Definition 3.2, find the optimal FIFO size distribution, task, and FIFO allocation such that the throughput is maximized. We consider static scheduling on the platform with predictable behavior. We assume the worst-case execution time for task execution, memory access, and communication, and therefore the scheduling anomaly is avoided.
In the preceding problem, the tasks in an SDFG interact by the use of FIFOs that are allocated to each edge. The FIFO size is not given a priori and should be determined by the algorithm. Since different memories differ in capacity and latency, the throughput not only depends on the FIFO size distribution but also closely relates to the FIFO allocation. In addition, the task assignment has an important impact on the task parallelism, making proper allocation a necessity. Therefore, all of the preceding aspects should be considered such that we can obtain an optimal throughput bound.
In this work, we make the following assumptions. We assume that the capacity of the off-chip memory is large enough such that there is enough memory to accommodate all FIFOs. Additionally, the IMEM and DMEM on each tile are assumed to be large enough to accommodate the code and stack/heap of all tasks bound to the processor on this tile. Moreover, for each tile, we assume that there is enough reserved space in the local SPM for one invocation of each task assigned to it. The reserved memory is not taken into account in the platform model and our algorithm.
ITFCS FRAMEWORK
This section introduces the ITFCS framework to produce near-optimal solutions for the problem stated in Section 4. Owing to the complexity of the problem, a decomposing methodology that decomposes the problem as several subproblems is adopted. This scheduling framework comprises four main steps and one iteration. Figure 3 outlines the components of the scheduling framework. Given an application modeled by an SDFG, the FIFO-throughput Pareto space is found first. Then, the FIFO size distribution is selected, and the FIFO allocation-aware task assignment (FAATA) algorithm is used to find the task-to-processor binding. Subsequently, we use the global FIFO allocation optimization (GFAO) algorithm to optimize the FIFO-to-memory assignment. Finally, a self-timed throughput analysis method is utilized to produce the final periodic static-order schedule (PSOS) based on the FIFO size distribution and FIFO/task mapping, and evaluate the throughput. The throughput analysis method is composed of four steps. First, it uses the memory access-aware synchronous data flow graph (MAASDFG) construction algorithm to model the memory access into the SDFG; second, the mapping-aware earliest task first (MAATF) scheduling algorithm is used to produce the PSOSs of the tasks and memory accesses; then, the task and memory schedule is modeled into the MAASDFG to enable throughput analysis; and finally, the self-timed state space search is used to compute the throughput of the system. Since the exact relation between the FIFO size distribution and the throughput on a hardware platform is unclear, an iteration strategy is utilized to assess the performance of each FIFO size distribution using the proposed scheduling method. The iteration strategy can be used in different ways in practical usage. On the one hand, the iteration can be terminated if the optimum is not a necessity and a typical timing requirement is given. On the other hand, it can be used to obtain the optimal FIFO distribution by exhaustively evaluating the quality of each distribution, which is also used in this article to show the advantage of considering distributions in the scheduling process rather than the optimum one in terms of ideal throughput. In the following sections, we elaborate each step of the scheduling framework.
FIFO-Throughput Pareto Space Search
FIFO sizing is a critical step in implementing streaming applications on embedded systems. For one thing, the FIFO size distribution of an SDFG has significant impact on the timing behavior of the application (e.g., the throughput) [Stuijk et al. 2006a; Yang et al. 2011] , and the achievable throughput of the application is constrained by the size of each FIFO. For another thing, memories, especially high-speed memories, are expensive and power consuming, making it important to configure the platform with appropriate memory sizes and make full use of them while deploying the application onto the platform.
There are two usage patterns with respect to how FIFOs use the memory. One pattern shares the memory space among all FIFOs [Yang et al. 2011] , and the other allocates a specific memory space to each FIFO exclusively [Stuijk et al. 2006a] . We use nonshared FIFO allocation [Stuijk et al. 2006a] , in which the FIFO assigned to each edge exclusively occupies a memory block, which simplifies the FIFO management and complies with the concept of modular programming. Stuijk et al. [2006a] show that the application throughput is positively related to the FIFO size. However, the memory capacity and access latency are not taken into account, making this result inapplicable to our problem. For example, assuming that increasing the size of one FIFO can improve the throughput under the model in Stuijk et al. [2006a] , if this makes the FIFO too large to reside in the SPM, then it should be allocated to the off-chip memory, which would worsen the performance. Our experiments also demonstrate the preceding notion. Thus, rather than using the technique introduced in Stuijk et al. [2006a] to obtain the FIFO size distribution leading to maximum throughput without resource constraint, we use it to find all possible Pareto points (a 2-tuple comprising the FIFO size distribution and the corresponding throughput) and perform iterative analysis on them. Since each task is mapped to only one processor, autoconcurrency of any task is impossible while executing the application on the platform. To take it into account, each task in the SDFG is extended with a self-edge with one initial token to avoid autoconcurrency, as in Figure 2 , before searching the FIFO-throughput Pareto space so as to obtain more accurate FIFO size distributions. Since the technique introduced in Stuijk et al. [2006a] has pruned the Pareto space, the number of iterations needed to be carried out is reduced, thus improving the efficiency. Table I shows two FIFO size distributions of the SDFG in Figure 2 . As shown in the table, the total FIFO size (memory requirement) and the throughput of the first distribution are 7.32% and 20% larger than the second distribution, respectively, showing that the throughput increases with the total FIFO size. However, as shown later, this is not the case when considering platforms with resource constraints.
FIFO Allocation-Aware Task Assignment
Having obtained the FIFO size distribution using the technique introduced in the previous section, this section proposes the FAATA algorithm, as shown in Algorithm 1. FAATA aims at generating the task-to-processor assignment and determining to which processor each task should be assigned. Rather than only exploiting computation parallelism as the load balancing method does, the potential FIFO allocation is also taken into account such that a good balance can be made between computation and memory access. The idea behind the algorithm is that a task should be allocated to the processor that makes the load be balanced while leading to more local memory access. The load on a processor is denoted by the normalized load, and the local memory access is denoted by the normalized estimated potential local communication cost. The estimated potential local communication cost is computed according to the recursive Equation (3) by a dynamic programming technique. To combine these two aspects, the localization coefficient is introduced. OCM, NIC, ini, d, s) , and FIFO size distribution fs : E → Z + . Output: task-to-processor assignment proc : V → P. construct RASDFG; compute task priority according to Equation (1); sort the tasks in nonincreasing order of priority and obtain task list Q; repeat a ← pop up the first element in Q; for i = 0 to |P| − 1 do tentatively assign task a to processor p i ; compute the computation load l( p i ) of p i according to Equation (2) Algorithm 1 first constructs a resource-aware synchronous data flow graph (RAS-DFG) [Yang et al. 2011] based on the application model and the FIFO size distribution generated by the FIFO-throughput Pareto space search method. The RASDFG is created by adding FIFO size constraining edges and self-edges to the original SDFG. For each edge e ∈ E of the original SDFG, we add a FIFO size constraining edge (dst(e), q(e), src(e), p(e), fs(e) − iniTok(e), tokSiz(e)) to E and add a self-edge (v, 1, v, 1, 1, 1) for each task v ∈ V. The generated RASDFG corresponding to Figure 2 is shown in Figure 4 with the FIFO size distribution one in Table I . The generated RASDFG is strongly connected. Since the throughput of an SDFG is limited by its critical cycle of the corresponding HSDFG [Sriram and Bhattacharyya 2012; Stuijk et al. 2007 ], we use the estimated maximum cycle mean (MCM), as in Stuijk et al. [2007] , to compute the task priority. The estimated MCM of task v is computed by Equation (1),
where C v represents the set of cycles that include task v, and V c , E c are the sets of tasks and edges of cycle C v , respectively. For example, the priorities of tasks a 0 , a 1 , a 2 , a 3 , and a 4 of the RASDFG in Figure 4 are 18, 42, 45, 15, and 45, respectively. Task assignment is performed according to nonincreasing order of task priority. Each time, pop up the first task in the ordered task list and find the best processor where it should be assigned. A composed metric is introduced to evaluate the attractiveness of each processor by combining the computation and memory access. The metric is composed by the normalized computation load and the normalized estimated potential local communication cost. The computation load is computed by Equation (2),
where V i represents the set of tasks that are assigned to processor p i tentatively. The estimated potential local communication cost eplcc( p i ) of processor p i is calculated according to the recursive Equation (3) by a dynamic programming technique. The value of eplcc( p i ) represents the maximum achievable intraprocessor communication and indicates the priority of mapping the task to processor p i . For processor p i , we use V i to denote the set of tasks that are assigned to it tentatively and use E i to denote the set of edges between these tasks. The edges are indexed from 1 to |E i |. We use the cost matrix F in Equation (3) to store the intermediate results while computing the maximum local communication cost. Each element F(i, j) of the cost matrix F represents the maximum cost when trying to assign the first i FIFOs to the local SPM with size j. The element of F is initialized to be zero when the FIFO number or the SPM size is zero. After recursively computing the remaining elements of of edge e i is larger than the current SPM size j, then this FIFO cannot reside in the SPM and thus the cost is the same as F(i − 1, j) . On the other hand, the cost is the maximum of two costs. The first is F(i − 1, j), which means that the FIFO of edge e i is not assigned to the SPM. The second is the sum of F(i − 1, j − fs(e i )) and the access cost of the FIFO, meaning that the FIFO of edge e i is assigned to the SPM. The access cost of a FIFO is the total of all access operations on this FIFO. Since the access operations triggered by the source and destination tasks are the same and the memory reading and writing are assumed to have the same access latency, only the operations triggered by the source task are counted.
(3) After obtaining the computation load and estimated potential local communication cost, a metric value for each processor p i is computed by Equation (4), in which c ∈ [0, 1] is the localization coefficient used to weight the computation and local communication. The value of the localization coefficient depends on the application and platform. If the memory access has more importance on the performance (e.g., the application is data intensive or the SPM size is limited), then more weight should be put on the FIFO allocation. In this work, the value of the coefficient is obtained by experiment. For Equation (4), we assume that
= 1 when max p j ∈P eplcc( p j ) = 0 (which also implies eplcc( p j ) = 0 for all p j ∈ P). The first part of Equation (4) denotes the load on the processor, and the second part denotes the local communication on this processor. Using the coefficient c, the computation and communication is combined.
The FAATA algorithm is composed of two loops-one enumerates all tasks and the other tentatively maps the task to each processor-therefore, there are |P||V| iterations. The inner loop is dominated by the computation of eplcc( p i ), which is fulfilled by the dynamic programming technique, so the worst-case complexity is O(S fifo S spm ), where S fifo is the total size of all FIFOs and S spm is the maximum size of the SPMs. Therefore, the total complexity is O(|P||V|S fifo S spm ).
For the SDFG in Figure 2 , the throughput values obtained by load balancing task allocation and by our method are shown in Table II . Although distribution 1 has a higher throughput than distribution 2 in the ideal analysis, as shown in Table I , the latter one performs better on our example platform, with the throughput increasing by 4.71% and 6.45% when using load balancing and our algorithm, respectively. It shows that a good FIFO distribution in the ideal analysis does not necessarily perform well on a practical platform, proving the necessity to search the FIFO-throughput Pareto space. The awareness of FIFO allocation, which is captured by our method, also improves performance, with the throughput increasing by 34.85% under the former distribution and 37.10% under the latter. The task allocation with the second distribution when using load balancing and our method are shown in Table III in a bold font. When the load balancing method is used, a 2 , a 1 , a 0 , and a 3 are allocated to p 0 and a 4 is allocated to p 1 . However, if the FAATA algorithm with the localization coefficient being 0.15 is used, the task allocation is changed, with a 2 and a 4 being allocated to p 0 and a 1 , a 0 , and a 3 being allocated to p 1 . Because the ordered task list is a 2 , a 4 , a 1 , a 0 , a 3 , a 2 is assigned first, and processor p 0 is selected arbitrarily since both the processors are the same. Then, a 4 is to be allocated. When using load balancing, a 4 is allocated to p 1 to balance the computation load. However, by using the method proposed in this article, a 4 is still allocated to p 0 since there is an edge (i.e., e 4 ) between a 2 and a 4 , as shown in Figure 2 . The existence of e 4 , as shown in Table III , changes the attractiveness of p 0 defined by the metric of Equation (4) from 1.00 to 0.85, whereas p 1 increases from 0.87 to 0.89, so a 4 is allocated to p 0 . This new allocation, as shown in Table II , increases the throughput from 0.01176 to 0.01613, with an improvement of 37.10%.
In Figure 5 , both schedules are illustrated by Gantt charts. The executions of different tasks are represented by rectangles of different colors, and each remote memory access is represented by the rectangle of the same color with the task that triggers this memory access (however, the memory access may correspond to different FIFOs in these two subgraphs). Using self-timed scheduling, both executions enter a periodic state after a sequence of intermediate executions. As shown in Figure 5 (a), the execution enters the periodic state at time 153 and encounters the same state at time 238, so the throughput is 1/85. Similarly, in Figure 5 (b), the execution revisits the sate of time 65 at time 127, with the throughput being 1/62. In Figure 5 (a), fifo(e 0 ) is assigned to DRAM, fifo(e 1 ) is assigned to spm 0 , and the others are assigned to spm 1 . For this allocation, the overhead of transferring the output data of a 1 and a 2 is quite large, which decreases the throughput. In Figure 5(b) , fifo(e 1 ) and fifo(e 4 ) is assigned to spm 0 , and the others are assigned to spm 1 . For this allocation, task a 2 and task a 4 are allocated to the same processor and fifo(e 4 ) is assigned to the local SPM, so the overhead of transferring the output data of a 2 is reduced and the performance is better.
Global FIFO Allocation Optimization
Having obtained the FIFO size distribution and task assignment, the FIFO allocation needs to be determined. Since the FAATA can only determine the allocation of part of the FIFOs, the GFAO algorithm is proposed to allocate all FIFOs globally. This algorithm is based on integer linear programming (ILP), and it aims at minimizing the total memory access cost. In the ILP formulation, both local memory access and remote and off-chip memory accesses are taken into account explicitly. In addition, the memory access cost used in this section is different from Algorithm 1 by considering memory access latency. The computation of FIFO access cost is presented by Definition 5.1.
Definition 5.1 (FIFO Access Cost). The FIFO access cost fac(i, j)
is the summation of the access costs of the source task and destination task of edge e i to FIFO fifo(e i ) when it is allocated to memory m j . The value depends on where fifo(e i ), the source task, and the destination task of edge e i (i.e., src(e i ) and dst(e i )) are allocated. If fifo(e i ), src(e i ), and dst(e i ) are allocated to the same tile, then fac(i, j) = 0. If fifo(e i ) is allocated to the off-chip memory or a tile different from both tiles where src(e i ) and dst(e i ) are assigned, then the cost is computed by Equation (5). If fifo(e i ) and src(e i ) are assigned to the same tile, whereas dst(e i ) is assigned to another tile, then the cost is computed by Equation (6). If fifo(e i ) and dst(e i ) are assigned to the same tile, whereas src(e i ) is assigned to another tile, then the cost is computed by Equation (7). In these equations, ini is the number of clock cycles needed to initialize the memory access.
Since the the FIFO allocation problem is a combinatorial optimization problem, it is formulated as an ILP model that can be solved by anavailable solver.
In the ILP model, the binary allocating variable m i, j is used to denote where FIFO i is allocated. If m i, j equals one, then FIFO i is allocated to memory j; otherwise, it is allocated to another FIFO.
Since each FIFO should be allocated to only one memory, the following constraint should be satisfied:
The memory capacity constraints should be obeyed so that the memory can accommodate all FIFOs allocated to it:
where f s(i) is the size of FIFO i and s(m j ) is the capacity of memory m j . Since the aim is to minimize the total memory access cost, the variable C is introduced to represent the sum of the memory access cost of each FIFO, so the following constraint should be satisfied:
Thus, the objective can be represented by the following equation:
To improve the efficiency of the ILP solver, we use a greedy allocation algorithm to produce an initial solution. The ILP solver can use the initial solution to prune the solution space more efficiently, thus improving performance. The greedy allocation algorithm is composed of two steps: FIFO priority assignment and FIFO allocation. In the algorithm, the average cost-size ratio of the FIFO is used as the FIFO priority, which is defined as follows:
where f s(i) and pri(i) are the FIFO size and the priority of FIFO i, the numerator is the sum of memory access cost on each memory, and the denominator is the multiplication of the FIFO size and memory number. The FIFO size in the denominator is used to compute the cost-size ratio, and the memory number in the denominator is used to average the cost-size ratio. Having obtained the FIFO priority, the next step is to allocate the FIFO to memory. The FIFO is allocated in nonincreasing order of priority. For the current selected FIFO that has the highest priority in the set of nonallocated FIFOs, it is allocated to the memory that has sufficient remaining space to accommodate it and makes the cost the minimum.
To further constrain the computation time, a timeout can be set for the ILP solver. By using the timeout, the ILP solver will stop even though the optimal solution has not been found when timeout occurs; however, it will return the optimal solution found during the elapsed time interval.
Throughput Analysis
Since the objective is to optimize the application throughput, it is necessary to construct the schedule and analyze the throughput of the application. This section elaborates the methodology for constructing the schedule and analyzing the throughput given the FIFO distribution and the task/FIFO allocation strategy produced by algorithms introduced in previous sections.
Given a task and FIFO allocation strategy, it is still hard to find the optimal schedule in the view of throughput. It should be noted that the throughput differs with the execution style. One execution style is the so-called blocked schedule [Sriram and Bhattacharyya 2012] , in which a block is composed of one or several iterations. The iteration number in a block is called the blocking factor. In the blocked schedule, different blocks cannot overlap. In a block, different iterations may interleave rather than overlap, hence improving performance. Using the preceding execution style, it is important to find the optimal blocking factor to obtain good performance. However, as the blocking factor increases, the overhead for controlling the system increases. In Tang et al. [2015] , the throughput is calculated using the blocked schedule with the blocking factor being one, which prohibits pipelining executions, thus underestimating the throughput. As an alternative, we assume that the system executes according to a PSOS [Damavandpeyma et al. 2013 ] that forces the tasks to fire one by one in a given order and makes the system execute according to the self-timed schedule. In this kind of execution, different iterations may overlap, proceeding in a pipelining style. The throughput of the self-timed system can be analyzed utilizing the state spacebased method [Ghamarian et al. 2006] or via max-plus algebra [Geilen 2010 ]. Since the self-timed schedule would finally enter the periodic state after the transient state, the system could run in a time-driven style to guarantee the performance.
The state space-based throughput analysis method of SDFGs without resource constraints is proposed in Ghamarian et al. [2006] . In this method, the token distribution and task execution state specify the state. Since the SDFG is forced to be strongly connected and the range of each element of the state is finite, the state space is finite. After a finite number of transitions, some states will be revisited, hence forming a cycle. Having obtained the cycle in the state space, the throughput can be calculated directly. Another method for computing the throughput is to model the self-timed execution of the SDFG in max-plus algebra [Geilen 2010 ]. The eigenvalue of the max-plus matrix equals the reciprocal of the throughput. The preceding methods do not take the mapping and schedule into account, making the results inaccurate. It is possible to model the mapping and PSOS in the SDFG [Damavandpeyma et al. 2013] or the interprocessor communication graph (IPCG) [Bambha et al. 2002] . After embedding the schedule into the application model, the self-timed execution of the new model complies with the schedule exactly while still obeying the execution specification of the application. Therefore, by applying the methods introduced in Ghamarian et al. [2006] and Geilen [2010] on the schedule-aware application model [Damavandpeyma et al. 2013; Bambha et al. 2002] , the application throughput with resource constraints can be analyzed. However, the PSOS and the method for modeling the PSOS into the application model in Damavandpeyma et al. [2013] and Bambha et al. [2002] have only considered processing resource contention on each processor. Although Bambha et al. [2002] modeled interprocessor communication into the model, other kinds of resource contention have not been considered. Owing to the features of the platform, other resource contentions, including buffer size constraints, memory access contention, and DMA contention, should also be taken into account. To capture these aspects, we first introduce the MAADFG that models the remote memory access operations; then, the algorithm for constructing the contention-aware PSOS is introduced; and finally, we extend the technique introduced in Damavandpeyma et al. [2013] and model the contention-aware PSOS into the MAASDFG, thus enabling throughput analysis.
Since we assume that only remote memory access would consume resources, including the DMA and the memory port, the local memory access is ignored for simplicity. Although link contention is also possible in the NoC, we assume that the transportation on the NoC is contention free (e.g., by using the TDMA arbitrary policy to assign each connection a fixed bandwidth), thus avoiding the communication contention in the NoC. Whereas Poplavko et al. [2003] assumed that it takes extra time to copy the data from/to the input/output buffer, we assume that the processor can access all local memory directly without incurring any delay, as the CompSoC platform does. We model each remote memory access (writing and reading) as a task and reconstruct the precedences of the new SDFG, thus producing an MAASDFG. While constructing the memory access-aware SDFG, we start from each edge, with exception of self-edges, of the original SDFG. For each edge, based on the locations of the source task, destination task, and the FIFO, the edge is reconstructed by adding extra memory access tasks and additional precedence edges. Algorithm 2 illustrates the procedures for constructing the MAASDFG. First, the computation tasks are added to the model, as the first for-loop shows; then for each edge, as illustrated by the second for-loop, extra memory access tasks and precedence edges are added; and finally, self-edges are added to avoid self-concurrency in the last for-loop.
ALGORITHM 2: Memory Access-Aware SDFG Construction Algorithm Input: application model, platform model, task, and FIFO allocation. a, with the production rate and consumption rate being 2 and 3, respectively, and the initial token number being 2. We use f s to represent the FIFO capacity assigned to this edge, and we use ifs and ofs to denote the local input and output buffer size. The local input/output buffer is used when the read or produced data of a task has to be transferred from/to the remote FIFO. The input buffer size should be larger than the consumption rate of the edge, and the output buffer size should be larger than the production rate. Although a larger output buffer can make the task execute freely without waiting for buffer space to write output data, a smaller output buffer may serialize the execution and data transportation. A similar case also applies to the input buffer. For a task, we say that a FIFO is local for it if the FIFO is allocated to the memory that is hosted on the same tile as the processor onto which the task is mapped. Figure 6 (a) shows how to reconstruct the edge when the FIFO is local to both the source task and destination task. Since the FIFO is allocated to the local memory, both tasks a and b can access the memory directly free of delay; therefore, only a reverse edge directing from b to a needs to be added to constrain the FIFO capacity. The initial token number of the reverse edge is the same as the FIFO capacity, and the rates are just the opposite of the original edge. Figure 6(b) illustrates the case when the FIFO is remote to both tasks. In this case, the output of task a should be moved from the local output buffer to the FIFO located in the remote memory by the use of the DMA, and task b has to move the data from the remote FIFO to the local input buffer to execute. We assume that the local buffer is preassigned and that the size of it is known a priori. We add tasks w and r to represent the preceding remote FIFO accesses. We assume that the system uses a coarse-grain memory access pattern-that is, for an edge, the reading and writing are in terms of consumption rate and production rate, respectively. Hence, w and r would be fired the same number of times as a and b in longterm execution. Since the local output buffer of a is limited in size, the task cannot fire freely when w has not moved the data to the remote FIFO. To capture this constraint, in Figure 6 (b), an extra edge is added between w and a, with the initial token number being the size of the local output buffer. The rates of the edge between w and a are the same as the production rate of the original edge (i.e., 2) to make w and a have the same firing frequency. The relation between a and b in the original edge is moved to w and r; therefore, the firings of w and r are constrained by the FIFO capacity (fs), and the FIFO access pattern remains the same. Figure 6(c) shows when the FIFO is local to task b but remote to task a. In this case, task b can access the FIFO directly; however, the output data of task a needs to be transferred to the FIFO from the local output buffer by the use of the DMA. As shown in the figure, the local output buffer is limited in size, as captured by the reverse edge from w to a. To constrain the FIFO access pattern, the edge between w and b is the same as the original edge. Finally, Figure 6 (d) depicts the case when the FIFO is remote to the reading side and local to the writing side.
src(e) = dst(e) do if islocal( proc(src(e)), mem(fifo(e))) && islocal( proc(dst(e)), mem(fifo(e))) then add edge (src(e), p(e), dst(e), q(e), iniTok(e)) to E maa ; add edge (dst(e), q(e), src(e), p(e), f s(e)) to E maa ; else if !islocal( proc(src(e)), mem(fifo(e))) && !islocal( proc(dst(e)), mem(fifo(e))) then
Using the method introduced earlier, the MAASDFG can be constructed. However, the self-timed execution of the MAASDFG does not take into account resource contention. To solve this problem, we construct the contention-aware PSOS and model it into the MAASDFG.
To construct the contention-aware PSOS, three kinds of resources have to be arbitrated. The first one is the arbitration when tasks compete for the processor, which is extensively investigated in Damavandpeyma et al. [2013] . To take into account memory contention and DMA contention, when self-timed executing the MAASDFG, not only should the processor contention be detected but also the memory access contention and DMA contention. Memory access contention is also called port contention. The memory can support parallel memory access if it has more than one port; otherwise, the operations should be serialized. We assume that each memory provides only one port for remote access, so only one remote transaction can be fired on a typical memory at the same time. The SPM is configured with two ports-one for local access and one for remote access-so it supports two parallel accesses. However, any remote access on it would occupy its external port exclusively; hence, all remote operations on it should also be serialized. For each processor, reading or writing a remote memory is fulfilled by the DMA device on the same tile. Therefore, all remote memory accesses initiated by each processor should be serialized to avoid DMA contention.
We use the MAETF scheduling algorithm (Algorithm 3) to construct the contentionaware PSOS. MAETF has taken into account the FIFO size constraints, memory access latency, and port contention when accessing the memory, and yields both task ordering on each processor and memory access ordering on each memory and the DMA. MAETF is a list-based algorithm. As shown in Algorithm 3, first, the SDFG is transformed to an APG using the technique introduced in Sriram and Bhattacharyya [2012] ; second, the tasks in the APG are ordered in nonincreasing order of the bottom level, which is the length of the longest path starting from the task; and finally, the tasks are scheduled one by one while taking into account their mapping information. To avoid resource contention, while scheduling the task in the APG, resource contention is arbitrated. For the task to be scheduled, on the resource on which it depends, all time intervals that have already been occupied by other tasks scheduled earlier are found, and the new task is scheduled to an earliest unoccupied time interval using an insertion strategy [Sinnen 2007] . By the preceding process, the ordering of computation tasks and memory accesses on each resource (the contention-aware PSOS) is obtained. The contention-aware PSOS can be modeled into the MAASDFG using the technique introduced in Damavandpeyma et al. [2013] . During the modeling process, three kinds of resource contentions captured by the contention-aware PSOS (i.e., the processor contention, DMA contention, and memory port contention) should be considered). The tasks in the MAASDFG are categorized as three kinds based on the resource they use: computation task, the DMA task, and the memory task. The computation task uses a processor to process data, the DMA task uses the DMA to access data in remote memory, and the memory task occupies one port of the remote memory to access data. Each task in the MAASDFG is associated with the resources it uses. We use Res to denote the set of resources in the platform, including processors, the DMAs, and memories. Thus, the decision state in Damavandpeyma et al. [2013] is a state when multiple tasks use the same resource in Res. After obtaining the decision state, each opponent actor set that uses the same resource is detected. According to the contention-aware PSOS, for each opponent actor set, the actor of choice is picked out, then extra edges are added to the model between the actor of choice and other tasks in the decision state to make the self-timed execution comply with the PSOS.
EXPERIMENTS AND RESULTS
In this section, the proposed methods are evaluated experimentally by comparing them to the noniterative edition of the scheduling framework, the load balancing method [Stuijk et al. 2007 ], the HAFF algorithm [Zhang et al. 2010] , the blocked schedule-based throughput analysis method [Tang et al. 2015] , and the MILP-based schedule algorithm [Damavandpeyma et al. 2011] . We first introduce the platform configuration and the benchmark used in the experiments. Then the algorithm performance is evaluated in terms of throughput.
Platform Configuration and Benchmark
We use several MPSoCs with different configurations in the experiment to test performance of the algorithm. The MPSoCs are configured with different processor numbers, ranging from two to six. We assume that the processors and memories are interconnected by an NoC [Goossens et al. 2013; Stefan et al. 2014] and that the NoC links are reserved for each connection (e.g., by using the TDMA policy).
The memory access delay is primarily composed of the NoC delay and the memory response delay. Data traversing the NoC link can be pipelined, so the delay equals 4+n, assuming that the link and router delay are both one cycle [Stefan et al. 2014] and all routes are of two hops, where n is the data size. The memory response delay depends on the memory type. For the SPM, the response delay is generally one cycle [Banakar et al. 2002] , so the memory access delay of a remote SPM is 4 + 2n. For the off-chip memory, we assume that the worst-case memory access delay is 60 + 9n.
A set of practical applications and randomly generated SDFGs with different sizes are used as benchmarks for performance evaluation. The practical applications include the H.263 decoder [Stuijk et al. 2007 ], H.263 encoder [Oh and Ha 2004] , samplerate conversion [Bhattacharyya et al. 1999] , bipartite [Bhattacharyya et al. 1999] , MPEG-4 SP decoder [Geilen 2010 ], MP3 decoder granule/block [Geilen 2010 ], modem [Bhattacharyya et al. 1999 , and channel equalizer [Ritz et al. 1995] . The open source tool SDF3 [Stuijk et al. 2006b ] is used to generate random SDFGs of 10, 20, and 30 tasks. The repetition vector is also randomly generated with a constraint on the sum of the repetition vector entries. In our experiments, the constraint is set to be five times the number of tasks. Hence, there are about 50 to 150 task instances in one application iteration. The graph properties, such as the in-degree/out-degree and edge production/ consumption rate, are all randomly generated given the corresponding average value, minimum/maximum value, and variation. The in-degree and out-degree are both given the average value and variation of 2, the minimum value of 0, and the maximum value of 4. The production and consumption rates are given the average and variation of 5 and 7, the minimum value of 1, and the maximum value of 9. For each graph size, 100 random graphs are generated. The SDFG parameters, including task execution time and edge token size, are randomly generated. The task execution time is uniformly distributed between 400 and 1,000, and the token size is randomly generated under the constraint of the communication and computation ratio (CCR), which is defined as the ratio between the number of memory access operations and the total task execution time. In the experiments, different CCRs are used to configure the application to evaluate the effect of the CCR on performance.
Results of Random Applications
This section evaluates the performance of the algorithms proposed in this article with respect to synthesized applications. Several denotations are used to represent the comparison between different algorithms. "Comp1" represents the throughput improvement by searching the FIFO-throughput Pareto space over using the FIFO distribution with the maximum ideal throughput (i.e., the noniteration strategy). In this comparison, the same task/FIFO allocation strategy and scheduling method are used for each FIFO distribution. "Comp2" represents the throughput improvement by the use of FAATA over the load balancing method, with the GFAO and pipelined schedule method being used to allocate the FIFO and make the final schedule. "Comp3" represents the throughput improvement by the use of GFAO over HAFF, and the pipelined schedule method is used to construct the final schedule. "Comp4" compares the throughput analysis method proposed in this work to that in Tang et al. [2015] . "Comp5" compares the scheduling framework proposed here without performing iteration to that in Damavandpeyma et al. [2011] .
To have a global view of the performance of the proposed scheduling framework, we first summarize the average throughput improvement of each algorithm in the scheduling framework over all applications and platforms with different configurations. The proposed scheduling framework outperforms other scheduling strategies in different aspects. The iteration strategy outperforms the noniteration strategy in throughput by 38.36% on average. Even though it is achieved at the expense of runtime, for offline optimization when runtime is not a serious concern, it is a practical method to optimize the system. In addition, using FAATA can improve throughput by 8.08% on average. The FIFO allocation also influences performance, of which performance can be improved by 22.47% on average. The proposed schedule strategy is an effective method to produce compact schedules. Compared to the blocked schedule-based throughput analysis method proposed in Tang et al. [2015] , an average performance improvement of 6.54% can be achieved using our method. This is because in our method, the system executes in a self-timed style and different iterations can overlap, thus improving throughput. Since the MILP model [Damavandpeyma et al. 2011] is too large for applications with more than 10 tasks, we only carried experiments for 10-task application for the MILP. The proposed scheduling framework performs better than the MILP-based method, with performance being improved by 23.71%. Table IV illustrates the average throughput improvement between different algorithms with respect to application size. The performance improvement achieves 34.15% for the 10-task application set and increases to 40.80% for the 30-task application set, because larger applications also have a larger FIFO distribution space, providing more potential to improve performance. The performance of FAATA decreases with the application size, with performance improvement decreasing from 11.15% for 10-task applications to 6.44% for 30-task applications. GFAO performs better than HAFF for different application sets, with throughput being improved by more than 19.41%. The proposed schedule strategy outperforms the blocked schedule. An average performance improvement of greater than 4.97% can be achieved for different application sets. Table V depicts the average throughput improvement between different algorithms with respect to the processor number in the platform. On the two-processor platform, the iteration strategy achieves an improvement of 38.59% over the noniteration strategy, and the improvement decreases to 37.54% for the six-processor platform. The performance improvement of FAATA decreases from 9.00% to 7.55%. The FIFO allocation algorithm has a decreasing trend, with the value decreasing from 34.28% to 21.07%. For the schedule algorithm, the throughput improvement shows an increasing trend with the platform size, with the value increasing from 4.70% to 7.42%. The proposed scheduling framework outperforms the MILP for platforms with different processors, with the performance improvement being greater than 13.53%.
Table VI compares different algorithms with respect to the CCR of the application. For the application with a larger CCR value, the overhead of accessing the memory grows, making it more important to configure the FIFO size and FIFO allocation. As demonstrated by the results in Table VI , the performance improvements of the iteration strategy and FAATA increase with the CCR value. The iteration strategy has a performance improvement of 27.52% when the CCR is 1, and the improvement grows to 44.24% as the CCR increase to 5, showing that for a large CCR, it is more important to search the FIFO distribution space to obtain good performance. Similarly, the performance of FAATA increases with the CCR value, with the improvement increasing from 6.54% to 9.04%. GFAO outperforms HAFF for different CCRs by more than 10%. The performance of the self-timed schedule decreases with the CCR value, with the performance improvement decreasing from 6.90% to 5.23%. Similarly, the proposed scheduling framework outperforms the MILP for applications with different CCRs, with the performance improvement being more than 17.09%.
In the experiment, the average runtime of the load balancing method is 1.09 seconds, and that of the FAATA is 7.16 seconds. Since FAATA needs to evaluate the effect of local FIFO allocation, FAATA consumes more time. The runtime of GFAO is about three times of HAFF, the average runtime of HAFF is 4.37μs, and that of GFAO is 12.27μs. Throughput analysis consumes the biggest part of the total computation time, the pipelined method consumes 19.01 seconds on average, and the blocked schedule-based method consumes 16.62 seconds on average. As the task number of the application grows, the runtime increases at the same time. As the task number grows from 10 to 30, the runtime of FAATA grows from 0.80 seconds to 14.56 seconds, and the runtime of the pipelined method grows from 2.37 seconds to 44.14 seconds. However, the runtime of GFAO grows little, with the value increasing from 10.63μs to 15.60μs. The MILPbased method is time consuming, with the solving time being more than 10 minutes, and as the graph size grows, the runtime grows rapidly (e.g., even 12 hours is not enough to produce a valid solution for the 20-task applications).
Results of Practical Applications
To evaluate the practical usage of the scheduling framework proposed in this article, we test it on a set of real applications and compare it to the available methods. The parameters of the used real applications are shown in Table VII . The task number ranges from 4 to 16; however, the instance number in an application iteration reaches up to more than 1,000. We categorize these applications as two sets according to the task number in the application: one is the small application set consisting of the applications with fewer than 10 tasks, and the other is the large application set with more tasks. We test the small application set on a two-processor platform and the larger application set on both a two-processor platform and a four-processor platform. Table VIII depicts the experimental results of practical applications on the MPSoC with two processors. From the table, it is shown that the proposed scheduling framework works for practical applications. The iteration strategy can improve the throughput, with the improvement ranging from 4.42% to 16.22%. FAATA is effective for all real applications, and for some applications it is more important to consider the FIFO allocation while mapping the tasks. For example, for MP3 decoder granule, samplerate, and bipartite, more than 35% throughput improvement is achieved by using FAATA. From the fourth column of Table VIII , it is shown that for real applications, the FIFO allocation has more impact on system performance. For the applications we use, the throughput improvement of GFAO over HAFF ranges from 17% to 71%, with values being larger than those of randomly generated applications. For real applications, the self-timed schedule can still improve performance compared to the blocked schedule, although the improvement is not very significant. Cyclic precedences in the application model constrain the overlap between different iterations, so the throughput cannot improve a lot. Table IX depicts the experimental results for large practical applications on the MPSoC with four processors. From the table, it is shown that on large platforms, the proposed scheduling framework still works. Specifically, the iteration strategy can improve the throughput by 7.26% to 15.48%. FAATA performs better on larger platforms for the modem and channel equalizer, with the throughput improvement reaching 11.62% and 7.26%, respectively. For the applications we use, the throughput improvement of GFAO over HAFF ranges from 5.54% to 45.76%. The self-timed schedule can improve the performance compared to the blocked schedule, with the improvement reaching as high as 10.46% for the channel equalizer.
The MILP-based method is also applied to real applications in the experiment. However, since most of the real applications are too large, it is quite hard to solve using the MILP solver in a limited time. For example, 3 hours are not sufficient to produce a valid solution for an MPEG-4 SP decoder that has 63 task instances. Thus, only the results of applications with little task instances (i.e., MP3 decoder granule, modem, and channel equalizer) are recorded. According to the experimental results, the proposed method outperforms the MILP-based method, and the throughput is improved by more than 40% for all of these three applications, showing that appropriate use of the SPM has a greater effect on performance for real applications.
The runtime of the proposed algorithm is not large for real applications. In the experiment, the average execution time of LB and FAATA are 0.096 and 1.010, respectively. The runtime of GFAO is about two times that of HAFF, with the average execution time being 1.66 and 3.55μs, respectively. Throughput analysis consumes the biggest part of the total computation time, especially for applications with many task instance (e.g., H.263 decoder samplerate, conversion MP3, and decoder block). For these three applications, the pipelined method and the blocked schedule-based method consume 15.9 and 14.9 minutes, respectively. For other applications, the average execution times are 4.43 and 3.49 seconds, which are quite small. Compared to the MILP-based method, the method proposed in this article is less complex. Although the MIlP-based method needs several hours to find the solution, the proposed method can be finished in several seconds.
CONCLUSIONS
This article investigates the problem of scheduling streaming applications on MPSoCs with a predictable memory hierarchy by taking into account memory access latency and memory capacity constraints. The resources of practical embedded systems are limited. Therefore, it is critical to utilize them appropriately to meet the timing requirements of the application. An efficient ITFCS framework is proposed in this work to solve the scheduling problem. The scheduling framework consists of FIFO-throughput Pareto space search, the FAATA algorithm, the GFAO algorithm, and a self-timed throughput analysis method. Extensive experiments are carried out on both random SDFGs and practical applications. Experimental results show that our method outperforms other methods. In the future, we will study the use of metaheuristics to solve the problem.
