Abstract-Software-controlled Scratch-Pad Memory (SPM) is a desirable candidate for on-chip memory units in embedded multi-core systems due to its advantages of small die area and low power consumption. In particular, data placement on SPMs can be explicitly controlled by software. Therefore, the technique of data distribution on SPMs for multi-core system becomes critical in exploiting the advantages of SPM. Previous research efforts on data allocation did not consider the placement of array data accessed in loops. Loops are the most time-consuming and energy-consuming part for most of the computationintensive applications. In this paper, we propose a highperformance, low-overhead data distribution technique, the Iterational Optimal Loop Data Distribution Algorithm based on dynamic programming. It optimizes data allocation of both scalar and array data for embedded multi-core systems with SPMs. The experimental results show that the IOLDD algorithm reduces the energy consumption by 30.12% and 14.52% on average compared with random data distribution and greedy stretagy, respectively. It also reduces the memory access time by 18.45% and 18.38% on average compared with the random distribution strategy and the greedy strategy, respectively.
I. INTRODUCTION
M ULTI-CORE design becomes the mainstream of high-performance embedded systems because of the ever-increasing demand on performance for applications such as digital signal processing, wireless communication, and mobile computing. Meanwhile, the design of multi-core systems usually has to satisfy strict requirements on low power consumption and small die area. Therefore, Scratch-Pad Memory becomes an effective design alternative to replace cache as on-chip memory in embedded multi-core systems. Software-controlled SPM guarantees a single-cycle access time with low energy consumption and small die area compared with hardwarecontrolled cache. In particular, data on SPM can be precisely controlled by software during system design. Many digital signal processing systems such as Analog Devices ADSP-BF534/6/7 [1] and TI's TMS370CX7X [2] , as well as multicore architectures such as NVIDIA GeForce 8800 [3] , employ SPM as on-chip memory [4] . Therefore, how to efficiently distribute data items to onchip SPMs to minimize the memory access cost becomes one of the key problem for fully exploiting the advantages of SPMs in embedded multi-core systems.Because loops are the most time and energy consuming code section in most of the computation-intensive applications, it is desirable to have efficient techniques to allocate array data, as well as scalar data, to multiple SPMs in a multicore system.
A lot of multi-core systems employs symmetric multiprocessing (SMP) architecture. Multiple cores share a centralized main memory. Each core is equipped with a small and fast on-chip SPM to speed up data accesses. Usually, there is only one copy of each data item. Data items accessed by multiple cores can spread on SPMs for multiple cores. The cost for searching and moving data items around, however, is high for multicore systems. In this paper, we propose a technique for keeping and updating one copy of array data in main memory efficiently with a minimized backup cost. Furthermore, we propose a data duplication method to replicate local copies for read-only data to further reduce the data access cost.
Because of the capacity of SPMs is limited, only the commonly used data or data quite critical should be loaded into SPMs. Other data items are stored in off-chip main memory [5] . In this paper, a dynamic programming approach is used to produce optimal results for embedded multi-core systems with SPMs. The approach also achieves the goal of distributing both array and scalar data items in loops on multi-core systems and minimizing the time cost and energy consumption.
In this paper, we make the following contributions: 1) We propose a polynomial-time data distribution algorithm, the Iterational Optimal Loop data Distribution algorithm with Duplication (IOLDD), to minimize the total cost of memory access on multicore systems equipped with SPMs for both arrays and scalar variables in loops. 2) We present a data duplication technique and integrate it into the data distribution algorithm. It further reduces the total cost of memory accesses by replicating multiple copies for read-only data items.
The rest of this paper is organized as follows. Related works are discussed in Section II. Models and some basic concepts are introduced in Section III. A motivational example is discussed in Section IV to illustrate some basic ideas of our algorithm. The problem definitions used in the paper are given in Section V. In Section VI, details of our improved dynamic approach IOLDD are presented. Section VII presents our experiments and Section VIII concludes the whole paper and mentions the future work.
II. RELATED WORKS
There are a lot of works tackling the data distribution problem. Some of the works proposed static data distribution methods. The data distribution is determined for the whole program and will not change during the execution of the program [6] [7] [8] . The drawback of static methods is that it cannot explore the benefit of varying data locality in a running environment. The other category of previous techniques is dynamic data distribution [9] [10] [11] . For those dynamic methods, program will be divided into different regions. Data movement instructions are inserted before each region to generate data distribution for a program region. The data distribution remains the same in the execution of a particular region. Greedy strategy, for example, is used to find a data distribution for each region by Udayakumaran in [12] [13] . Since dynamic data distribution takes advantage of the data locality of each program region, they have better performance than the static ones.
Array data is different from scalar data. Elements in array occupy contiguous memory locations. A single iterative statement in loop may process arbitrarily many elements of an array. Distributing array data is, therefore, quite different from the way we handle scalar data. To the best of the authors' knowledge, there is not much research work conducted on data distribution for arrays in a loop, and some methods greatly rely on the loop's characteristics. O. Ozturk et al. proposed algorithms to manage data for array-intensive nested loops with regular data access patterns [14] [15] . R. Thakur et al. proposed efficient algorithms to manage dynamic redistribution of arrays [16] . W. Huan et al. proposed algorithms to optimize all of the data segments, including global data, heap and stack data in general [17] . These methods do not consider the case of distributing both array and scalar data items in a loop on multi-core systems.
Research efforts have also been taken on the data distribution problem for SPMs. R. Banakar et al. proposed a simple SPM data management algorithm. But the algorithm cannot guarantee to achieve optimal results and is only applicable to scalar data [4] . Jun Zhang et al.
proposed an algorithm for loops on single-core systems instead of multi-cores [18] . Y. Guo and Q. Zhuge et al. proposed a polynomial-time algorithm to solve the data distribution problem on multiple types of memory units [19] [20] . However, they only consider distribution for scalar data items and do not mention data distribution for loops. The data distribution problem for array data is very important for most of the applications. In this paper, we focus on developing a dynamic programming approach for both array and scalar data items on multicore systems.
III. MODELS AND BASIC CONCEPTS
In this section, we first introduce the hardware architecture. Then, we will present the program execution model we use in this paper. The organization of on-chip memories of our hardware architecture is shown in Fig. 1 . Every core has its own onchip SPM, while all cores share the DRAM main memory. Each core can access its own local SPM. It can also access data items on other cores' SPMs by the interconnect bus. Scratch-pad-memory here can be organized as a Virtually Shared SPM (VS-SPM) architecture for on-chip memory that takes advantage of both shared and local SPM [21] . Distinct from the local SPM, SPMs of other cores are referred to as remote SPMs. There is no limit for the number of remote SPMs that a core can access in the architecture.
A. Hardware Architecture
In Fig. 1 , we show three types of memory access pattern as depicted by three types of dot lines. They are local access, remote access and off-chip access, corresponding to the accesses for cores to local SPM, remote SPM and off-chip memory, respectively. Due to the communication cost of the interconnected data bus, remote access incurs longer latency than the local access. Objectively, for the low performance of DRAM and the high communication cost, latency of off-chip access is much longer than the latencies of both local and remote accesses [22] . In our architecture, each core can access all remote SPMs. Let Dist be the distance between two cores. The cost of remote access is a non-decreasing function f of Dist. In this paper, we consider the data distribution problem for loops can be paralleled in the program. A barrier is used to synchronize the execution of each iteration for all the loops. We also consider each basic block of a program as a program region. The execution model of loops executed in parallel on a multi-core system is shown in Fig. 2 . Assume that there is no conditional branch in loop body. Each iteration is regarded as a program region by the compiler. The number of accesses on each data item in a program region can be obtained through profiling. Compiler inserts data distribution instructions at the beginning of a loop iteration. Therefore, data items are allocated to various memory units before parallel regions are executed. In case of conditional branch existing in loop body, each branch then should be considered as a program region. Data distribution instructions should be inserted at the beginning of each region. The data distribution problem considered in this paper tries to explore the opportunity of the optimal data placement on SPMs in multi-core systems. It aims to improve the performance and reduce the cost of memory accesses for our execution model.
B. Execution Model

IV. MOTIVATIONAL EXAMPLE
In this section, a motivational example is presented to illustrate the main idea of the proposed algorithm (IOLDD). The goal of optimization is to minimize the total memory access cost of a parallel iteration in loops.
In this motivational example, we assume all data items have the same size. Thus, the size of SPM is denoted by the number of data items that can be stored in SPM. Focus on the architecture shown in Fig. 1 , we assume the system has only two cores Core 1 and Core 2 . Each core is equipped with an on-chip SPM marked as SP M 1 and SP M 2 , respectively. For the purpose of simplicity and illustration, we assume that SP M 1 has a capacity of two, and SP M 2 can hold three data items in the motivational example. All cores can access the shared main memory, which is large enough to store all data items. Given two loop programs, as shown in Fig. 3 
The number of seven data accesses in one iteration is shown in Table I . Both Core 1 and Core 2 can access these data items and they run in parallel. The "Access" operation includes both "Read" and "Write" operations for the core. In Table I , take Core 1 for example, row "Core1 Access " shows the access times of Core 1 for the data items. The corresponding loop program in Core 1 is depicted in Fig. 3 (a). For data d 1 , we know Core1 Access (d 1 ) = 0 and Core2 Access (d 1 ) = 1. In Loop 1 , Core 1 has neither "Read" operation nor "Write" operation for data d 1 . While in Loop 2 , data d 1 is read once. In this paper, we define a data which is "Read" by some cores and not "Written" by any core as a "Readonly" data. The problem of this example is how to find a data distribution for the seven data items in each iteration such that their total cost of memory access is minimized. Table II shows the notation, the time cost (in µs) and the definition of memory operations. All data items are supposed in the main memory in the initial data distribution. In this example, we assume the non-decreasing We solve the problem with three strategies. One is the greedy strategy (Uday), which is derived from Udayakumaran's algorithm [12] [13] on single-core systems. The other is dynamic programming strategy on single-core (IDAS), which is derived from Zhang's paper [18] . Finally, it is our iterational optimal strategy (IOLDD), which will be presented in detail in Section VI.
The greedy algorithm is derived from Udayakumaran's algorithm in [12] [13], and we call it "Uday" for short. The algorithm is a greedy algorithm, it distributes data items according to their read and write access times of all cores. However, the approach only targets single-core processors. For the purpose of comparison, we adopt the algorithm and apply it on multi-core systems.
The derived Uday algorithm works as follows: To begin with, data items are sorted according to their total number of accesses which is the sum of the number of accesses from all cores to this data. After that, data with the most total number of accesses is picked by the compiler. Then the compiler distributes the data into the available SPM of the core that accesses the data most times. When all the SPMs of the cores are full, the data should be distributed into the main memory.
As to the IDAS algorithm, it is derived from Zhang's algorithm in [18] . Though the algorithm is a dynamic programming algorithm, it is limited in single-core systems. Besides, it can not fully utilize the benefits of the private local SPM on each core. On the other side, the IOLDD algorithm propose a duplication mechanism to utilize the distinct costs of data to/from local SPM, remote SPM and main memory. It is a technique that achieves higher time efficiency at the cost of space. In multi-core systems, traditionally, a data item has only one copy in either one of the SPMs or main memory. Multiple cores may access the same data in one parallel region. It is sometimes beneficial to duplicate data and place multiple copies of the same data in different SPMs. However, some of the data items cannot be duplicated because of high synchronization cost. Therefore, we only allow read-only data items to be duplicated. 
For function dist
represents the location of data d j before the execution of iteration P in loop body.
Loop Data Distribution Problem. Given a set of data items D in iteration P , a set of SPMs SP M . Size spmi represents the capacity of spm i ∈ SP M , Size dj means the size of data d j . The loop data distribution problem is to find a mapping between data d j ∈ D and SPM spm i ∈ SP M . The total cost of memory access Cost dist P (dj ) (d j ) is minimized, and inequality
The data migration cost is the cost of retrieving one data item from its original location and writing it to another memory unit. Hence, the migration cost is defined as the sum of one read operation from the original location and one write operation to the destination memory unit, as shown in Equation 1.
Related Array. An array is related when the array data item accessed in the previous iteration is still accessed in the current iteration. Unrelated Array. An array is unrelated when the array data item accessed in the previous iteration is not accessed in the current iteration. The relation of an unrelated array B is also defined as relation B . It equals to −1. All data items in unrelated array are also unrelated. We define:
Definition 2: Updating Cost. The updating cost is computed only for array items to ensure that there is a copy of that array with newest data value in main memory. It refers to the cost of reading array data item from its distributed location (except main memory), and writing it back to main memory. With the array updated into main memory, we do not need to seek scattered data items on different SPMs. It is convenient to access array data items because they are stored continuously in the main memory.
Data items in array may come from related arrays or unrelated arrays. To eliminate extra cost, only array data A[i−relation A ] is updated to main memory. When array data is already in main memory, it does not need to be updated.
Theorem 1: In loop data distribution problem, array data d j will be distributed to the main memory, if and only if relation dj ̸ = 0 and dist Definition 4: Moving Cost. The moving cost is computed when data is going to be duplicated. It equals to the sum of migrating the data item from its original distributed location to the destination SPMs. We can compute the cost by Equation 4 .
According to Equation 4 , we can see that the moving cost is related to dist P −1 (d j ). If data d j was in one SPM before the execution of iteration P in loop, which is dist
The moving cost is the sum of each
cost of migrating data from the original SPM dist P −1 (d j ) to the destination SPMs. On the other hand, if data d j was in main memory before the execution of iteration P , say dist P −1 (d j ) = m 0 . The moving cost is the cost of migrating data from main memory to one destination SPM spm i , plus a sum of other cost of migrating d j from this spm i , which has one copy already, to other destination SPMs spm k (spm i , spm k ∈ SP M and k ̸ = i).
Definition 5: Duplication Cost. The duplication cost is a cost of copying one data item to other SPMs on multi-core systems. It can be computed as the sum of accessing cost for "Read" or "Write" operations in local SPM, the moving cost to the destination SPMs and the updating cost for array data items.
Without the duplication mechanism, if a data item is intensively accessed by multiple cores, a lot of remote accesses will be incurred. Wherever the data is distributed, only one core is benefited from the local SPM. Data duplication technique will solve the problem by distributing a copy of the data item to each SPM that may be benefited. As a result, the time and energy cost incurred by remote accesses is reduced.
For exclusive copy mode, there is no worry about the data consistency. However, for data duplication mode, the data consistency problem becomes a key issue. Since it is common for multiple cores to access the same data in one parallel region, inconsistency of this data will occur if multiple cores have "Write" activities. Though in write heavy applications, duplicating to-be-written data may be beneficial with a well-designed data consistency protocol, the overhead caused by maintaining data consistency may offset the benefits of duplicating written data. Therefore, only read data is allowed to be duplicated. We define CW rt i (d j ) to be the number of "Write" times for data d j by Core i . If a data item d j is updated by a core, it should not be replicated on other cores because of synchronization issues. Hence, the duplication cost of a data item when
No matter whether an array data can be duplicated or not, it should be updated into the main memory. The updating cost can be computed in Equation 2. The total duplication cost is computed as Equation 5.
VI. A DYNAMIC ALGORITHM FOR LOOP DATA DISTRIBUTION WITH DUPLICATION ON MULTI-CORE
In this section, we present details of the IOLDD algorithm on multi-core systems. The algorithm is a dynamic programming method and uses the technique of duplication. In our IOLDD algorithm, each iteration includes four steps. First, it computes the cost of each memory access. Second, it uses dynamic programming with duplication to decide optimal loop data distribution. Third, it redistributes data items, and the fourth, it stops the algorithm. The algorithm is shown in Algorithm 1.
Step , d 2 , . . . , d N ) , and spm i is in the list of SPMs SP M = (spm 1 , spm 2 , . . . , spm T ). For array items, we add an updating cost (Definition 2) to ensure that in main memory the copy of newest data in array can 1, i 1 , ..., ip + 1, ..., iq + 1, . . 1, i 1 , ..., ip + 1, ..., iq + 1, ..., ir + 1, . . 
Continue 26:
end if 27: end for 28: Since we have only two cores, the duplication cost can be written as Cost spm1+2 (d j ). The result of optimal data distribution is shown in "Iteration 2" of Fig. 4 . Based on the number of accesses in Table I , if all data items have their initial locations in the main memory, costs of memory access are computed in Table IV Step 2. Lines 5-40 determine the optimal loop data distribution with dynamic programming and duplication technique. The cost of memory access during the execution for one iteration is minimum.
Since a cost table is built in step 1, the optimal data distribution can be determined by using a multidimensional dynamic programming table. The structure of  the table is as following: the first dimension of the table is represented by data d j , each one of the other dimensions is represented by the available space of spm i ∈ SP M except m 0 . We assume m 0 is main memory and large enough to hold all data items in the program.
The IOLDD algorithm is presented in Algorithm 1. During the computation, location dj is used to keep the intermediate data, and the array BackP ath[j, i 1 , i 2 
We compute costs for data item d 1 in columns "d 1 " in all 2-D tables in a similar way. The final total cost of memory access with the optimal data distribution in the initial data distribution is 599. The backtracking path indicated by underlined cell in Table V Step 4. Lines 42-44 compare the results with previous iteration to decide when to stop the algorithm. At the end of our IOLDD algorithm in each iteration, we compare the distribution results with the results in previous iteration. If they are the same, stop doing the algorithm in loops. The data distribution is optimal and we could use that distribution for later iterations.
With the four steps mentioned above, we can conclude that the IOLDD algorithm has five features. Firstly, it computes to obtain the cost table of each memory access and duplication cost. Besides, array items in loops can be handled by the algorithm. Furthermore, the relations of array items are used to reduce the updating costs. Fourthly, it redistributes array items. Finally, duplication is proposed on multi-core systems.
The IOLDD algorithm's time complexity is O(N × Size spm1 ×Size spm2 ×. . .×Size spmT ). N is the number of data items in D, and T is the constant number of SPMs.
VII. EXPERIMENT
In this section, we compare our dynamic IOLDD algorithm with greedy algorithm and random algorithm, respectively. Benchmarks are chosen with both scalar and array data items in loop programs. The following benchmarks are used in our experiments: 2IIR, 4-lattice, ellENC, ellfilter, 8-lattice, allpole, C-sehwa, diff2, diff-ct1 and voltera. Three algorithms are evaluated by time cost and energy consumption for memory access with their generated distributions.
A. Experimental Setup
The architecture in the experiment has two types of memory units: four on-chip SPMs made with SRAM of four cores and a block of main memory made with SRAM. In this paper, we assume these four SPMs have the same size, with the capacity of 8 KB. We also assume the capacity of the off-chip main memory is 2.56 MB, which is large enough to store all data items. A set of parameters collected from CACTI tools provided by HP for these two memory types is shown in Table VI. A custom simulator based on SimpleScalar is developed to simulate the process of data distribution and obtain costs of memory accesses for the program. With the CACTI tools [23] provided by HP, we can obtain the latency and energy consumptions for memory accesses, then we use them as parameters on our benchmark programs. In the experiment, we run random, greedy and the dynamic programming algorithm IOLDD. Compute the time cost and energy consumption for all data accesses. Our program is easy to compatibly integrate into any compiler.
B. Experimental Results
In Fig. 5 , algorithms are compared via ten benchmarks include: 2IIR, 4-lattice, ellENC, ellfilter, 8-lattice, allpole, C-sehwa, diff2, diff-ct1 and voltera. Time costs and energy consumptions of data distributions on multicore systems are also presented. In Tab. VII, the column "Random" represents the algorithm that data items are randomly picked and distributed to four on-chip SPMs of the cores. Due to the randomness of the technique, the experiment is precessed 10 times to get an average number. The column "Uday" is a greedy algorithm proposed by the Udayakumaran's algorithm [12] . The column "IOLDD" denotes our IOLDD algorithm. It is a dynamic programming algorithm applied into both array and scalar data items in loop on multi-core systems. The percentage of improvement for IOLDD algorithm over the "Random" technique is shown in column "Imprv (IOLDD/Random)". The average improvement of time costs of the IOLDD algorithm is 18.45%. Moreover, "Imprv (IOLDD/Uday)" displays the improvement for "IOLDD" algorithm over "Uday" algorithm, and the average improvement is 18.38%. As shown in the experimental results, our IOLDD algorithm achieves the best improvement of time costs on average among all other techniques. In the best case, e.g. C-sehwa, the percentage of improvement in time cost is 52.12% over "Random" algorithm and 54.26% over "Uday" algorithm.
Not only data accesses time is reduced because of the effective solution of data distribution, but also the energy consumption lessens. Both Fig. 5 and Tab. VIII show the comparison of energy consumption among various data distribution solutions that are generated by various techniques. Accordingly, the average improvement of IOLDD algorithm over random technique is 30.12%, and the average improvement of IOLDD algorithm over Uday algorithm is 14.52%, our IOLDD algorithm also achieves the best improvement of energy consumption. In the best case, e.g. C-sehwa, the percentage of improvement in energy consumption is 64.63% over "Random" algorithm and 57.99% over Uday algorithm.
Experiments indicate that IOLDD algorithm obtains better improvements in most of the time and energy costs compared with the Uday algorithm which employs a greedy strategy. The major advantages in techniques compared with the Uday algorithm can be seen in three parts. First, the IOLDD algorithm considers the initial data distribution of the loops and the effect of migrating data items. Therefore, it can generate the optimal solution for iterational data distribution. Second, the IOLDD algorithm takes the cost of updating array items into account and uses the relations of these array items to minimize the cost, while Uday algorithm does not. Third, the IOLDD algorithm uses the technique of duplication to reduce the cost.
VIII. CONCLUSION AND FUTURE WORK
In this paper, we achieve the minimum cost of loop data distribution on multi-core systems by developing an Iterational Optimal Loop data Distribution algorithm with Duplication (IOLDD). The algorithm is used on multicore systems, while the algorithm Zhang et al. proposed is limited to single-core systems and cannot fully utilize the benefits of the private SPM on each core [18] . The IOLDD algorithm improves the performance on multi-core systems by taking the relations of an array, the updating cost and the duplication technique into consideration.
To explore more, in the future, we will further consider the problem of distributing loop data and avoiding contention when there are multiple "Write" activities on the same data for multiple cores. As to the mechanism of duplication, we will explore whether data still can be duplicated on condition that data is not read-only, and how to duplicate it. Besides, we will consider the problem when the main memory is a mixed structure of DRAM and non-volatile main memory (NVM). We will also deal with the problem of reducing the "Write" activities to the NVM in a heterogeneous system. 
