Abstract-The era of distributed computing, where applications are executed on platforms like clusters, grids and/or clouds of computers, have shown the need for taking into account the communications that take place on distributed computer architectures when executing applications. In that environment, different communication-aware mapping techniques were proposed for improving the system performance, both for off-chip and for on-chip networks. Some of these proposals are based on heuristic search for finding pseudooptimal assignments of a given population of tasks and processing elements. The technology improvement has allowed a significant increase in the problem size, multiplying the number of processor cores in each chip. Therefore, the proposals based on heuristic search must be accelerated in order to search in larger exploration domains within the same execution times.
I. INTRODUCTION
Last years have witnessed the advent of distributed platforms like clusters, grids and/or clouds of computers. These distributed platforms are composed of either specific or commodity computers interconnected through high performance network(s) (like Myrinet [1] , Gigabit Ethernet [2] , etc.) or even Internet. The use of these platforms has shown the need for taking into account the communications that take place on distributed computer architectures when executing applications, and how the communications among threads, tasks and/or processes (depending on the level of parallelism) can have a significant effect on the execution of the applications.
A lot of research has focused on solving the NP-complete problem of efficiently scheduling diverse groups of tasks to the machines/processors that form the system based on the computational requirements of the applications, for both off-chip and on-chip environments [3] , [4] , [5] , and also some proposals taking into account the communication requirements of the applications have been made [6] , [7] . In both domains, the communication-aware technique is based on performing a heuristic search for finding pseudooptimal assignments of tasks to processors (or processing elements to network nodes in the case of Networks-on Chip (NoCs)). The best results for this mapping problem have been achieved by using a random search method. However, the execution time required for the search is directly related to the problem size, and therefore a parallel implementation of the search is required in order to accelerate the search, guaranteeing the feasibility of the mapping method. The current trend, not only in the off-chip but also in the on-chip environment, is towards large scales, with up to thousands of nodes where to map each task (clusters) or processing elements (NoCs) in a near future. Therefore, distributed implementations of the search method are required in order to guarantee the feasibility of the communication-aware technique.
Different works have proposed the adaptation of heuristics search methods for their implementation on GPUs [8] , [9] , and even an intensive survey have been published [10] . However, most of the works usually compare the performance provided by the GPU, that is a many-core device (tens or hundreds of cores), with the performance provided when executing the same method on a CPU, that is a multi-core device (a bunch of processing cores), resulting in an unfair comparison.
In this paper, we propose a comparative study of parallel implementations of the local search method used in both task mapping and topological mapping on different architectures with known theoretical performances. Unlike other comparative studies of heuristic methods implemented on GPUs, we compare the actual performance provided by the parallel version for GPUs with the actual performance provided by the MPI parallel version executed on a cluster computer. Also, we have considered a GPU based on the Fermi architecture [11] , that requires a different implementation of the GPU-based algorithm (it changes the way that GPUs have been usually programmed). The results show that the GPU implementation provides mappings with similar quality than the MPI implementation, but the execution time required for providing that solution is significantly shorter than the one required for providing the solution in the MPI implementation. Moreover, the differences in the execution times increases as so does the problem size. Therefore, these results validate the GPU implementation as a very cost effective accelerator for the CommunicationAware Task Mapping Techniques.
II. PROBLEM DEFINITION AND LOCAL SEARCH ALGORITHM The topological mapping problem consists of mapping an Application Characterization Graph on a network topology in such a way that some metric(s) of interest is(are) optimized. In order to achieve such purpose some definitions should be done.
The Application Characterization Graph, AP CG = G(T, C), is a directed graph, where T is the set of tasks and C is the set of communications. Each communication c i,j = (t i , t j ) ∈ C connects task t i ∈ T to task t j ∈ T . For a communication c ∈ C, the function vol(c) returns the communication volume (bytes) of c. This is the number of bytes that task t i sends to task t j .
The Topology graph, T G = G(N, P ), is a directed graph which models the network topology. N is the set of processing elements. It must be noticed that in order to take into account the multicore processors, N can be defined either considering each core as a processor, or considering each multicore processor (chip) as a single processor capable of executing more than one task. However, for the sake of simplicity we will consider in the rest of this paper that N is defined considering that each core is a processor (a processing element capable of processing a single task). P is the set of paths. Path p i,j = (n i , n j ) connects node n i ∈ N to node n j ∈ N . Given a path p i,j ∈ P , the function dist(p i,j ) returns the distance as the number of hops between nodes n i and n j respectively. It should be noticed that multicore processors can easily modeled by connecting all the cores in a multicore processor to the same network node.
The Mapping function, M : T → N , is a bijective function which maps tasks to network nodes (e.g., if M (t i ) = n j then task t i is mapped on processor n j ).
The Fitness function, f : M → assigns a real value to each mapping according to equation 1
Therefore, the goal is to find a mapping function, M , such that f(M ) is minimized. Since this problem is an instance of the QAP which is NP-hard, a multi-start local search method has been used to solve it, for both the off-chip and the on-chip environment [6] , [7] . The pseudo-code of the multi-start local search algorithm is shown in Figure 1 . First, the APCG and the TG, represented as matrices, are loaded (line 1). After that, the sequential mapping (i.e., a mapping where task t i is mapped on processor n i ) is evaluated and saved as the best mapping found up to the moment (lines 2-3). Then, the method continues creating a set of initial random mappings 1 (line 5) which are explored using the local search procedure (line 6) described in Figure 2 . Once the local search procedure ends, the resulting fitness value is compared with the best fitness value obtained so far and in case that the former improves with regards to the latter, the mapping and the fitness value returned by the local search procedure are saved as the best ones. Otherwise no savings are done (lines 7-10). The process is repeated until every initial mapping is explored (lines 4-11). Finally the best mapping found and its corresponding fitness value are returned (line 12).
(lines 2-3), the method continues creating a set of initial random mappings 2 (line 5) which are explored using the local search procedure described in Figure 2 (line 6). Once it finishes, the fitness of the resulting mapping is compared with the best fitness obtained so far and in case that the former improves with regards to the latter, the mapping and the fitness value returned by the LocalSearch() procedure are saved as the best ones. Otherwise no savings are done (lines 7-10). The process is repeated until every initial mapping is explored (lines 4-11). Finally the best mapping found and its corresponding fitness value are returned (line 12).
The LocalSearch() procedure performs a local search in the neighborhood, N (), of a solution mapping x as follows. First, a neighbor y extracted from the neighborhood of x is obtained (line 3). After that, the fitness values of both x and y are calculated and compared (line 4). If the fitness value of y, f(y), is smaller than the fitness value of x, f(x), then y becomes the new solution mapping (line 5). The local search stops after a given number of iterations, numIter, if the fitness value of the current solution mapping is not improved (lines 2-10). This procedure returns the best mapping found in the local search, carried out starting from mapping x, and its associated fitness value. In this method, the neighborhood of a mapping x corresponds to the set of mappings that can be obtained from x when two given elements of the mapping x are swapped.
Algorithm Multi-Start_LocalSearch(numSeeds, numIter) / * Inputs: * / / * numSeeds -Num. of initial mappings * / / * numIter -Num. of Iter. without improving * / / * Outputs: * / / * bestM apping -best mapping found * / / * bestF itness -best fitness function values found * / begin 1. loadInputData() 2. bestM apping ← createSequentialM apping() 3. bestF itness ← f (bestM apping) 4. for i = 1 to numSeeds do 5.
if f (y) < bestFitness then 8.
bestM apping ← y 9.
bestF itness ← f (y) 10.
end if 11. end for 12. return bestM apping and bestF iness end / * Inputs: x -initial mapping * / / * Outputs: * / / * x -best mapping found * / / * f (x) -fitness of the best mapping found * / / * N () -set of neighbors for a given mapping * / / * f () -fitness function * / begin 1. iter = 0 2. while iter < numIter do 3.
end if 10. end while 11. return x and f (x) end 
A. Parallel Platforms
One of the parallel platforms for executing the local search algorithm has been a cluster of ten computers based on AMD Opteron (2 x 1.56 Ghz processors) with 3.84GB of RAM, executing Linux 2.6.9-1 operating system. The other parallel platform is a Bull X R425E2 server with a NVidia Tesla C2070 graphic card. The R425E2 server includes two Intel Xeon E5620 2,4 GHz quad-core processors, each one with its own Tylesburg I/O Controller, with a DDR3 (1333MHz) DRAM of 24 GB, and a SATA hard disk of 500 GB. This server also includes a NVidia Tesla C2070, a graphic card based on the "Fermi" architecture [11] , with 448 CUDA cores and 6 GB of GPU device memory. All NVIDIA GPU platforms from the G80 architecture can be programmed programming model, which makes the GPU to operate as a highly parallel computing device [12] . Each GPU device is a scalable processor array consisting of a set of SIMT (Single Instruction Multiple Threads) Streaming Multiprocessors (SM), each of them containing several stream processors (SPs). Different memory spaces are available in each GPU on the system. Figure 3 shows the organization of the memory spaces. The global memory (also called device or video memory) is the only space accessible by all multiprocessors. It is the largest and the slowest memory space and it is private to each GPU on the system. Moreover, each multiprocessor has its own private memory space, called shared memory. The shared memory is smaller and also lower access latency than global memory. In addition, there are other addressing spaces for specific purpose such as texture and constant memory [12] . Although this memory map is common to all GPUs, NVIDIA has introduced a cache hierarchy to allow the programmer a choice over the mapping of the data. In this sense, the new memory model in the Fermi architecture addresses this scheme by implementing a single unified memory request path for loads and stores, with an L1 cache per SM multiprocessor and unified 768 KB L2 cache, common to all SMs, that services all operations (load, store and texture). Figure 4 illustrates this model. The per-SM L1 cache is configurable to support both shared memory and caching of local and global memory operations. The 64 KB memory can be configured as either 48 KB of Shared memory with 16 KB of L1 cache, or 16 KB of Shared memory with 48 KB of L1 cache. When configured with 48 KB of shared memory, programs that make extensive use of shared memory can perform up to three times faster. For programs whose memory accesses are not known beforehand, the 48 KB L1 cache configuration offers greatly improved performance over direct access to device memory.
The CUDA programming model is based on a hierarchy of abstraction layers. The thread is the basic execution unit that is mapped to a single SP. A thread-block is a batch of threads which can cooperate together as they are assigned to the same multiprocessor, and therefore they share all the resources included in this multiprocessor, such as register file and shared memory. A grid is composed of several thread-blocks which are uniformly distributed and scheduled among all multiprocessors. There is not any particular order in the way of thread-blocks are executed, therefore they are executed in Multiple Instruction Multiple Data (MIMD) fashion. Finally, threads included in a thread-block are divided into batches of 32 threads called warps. The warp is the scheduled unit, so the threads of the same threadblock are scheduled in a given multiprocessor warp by warp. The 32 threads in a warp execute the same instruction over multiple data (SIMD). The programmer declares the number of thread-blocks, the number of threads per thread-block and their distribution to arrange parallelism given the program constraints (i.e., data and control dependencies).
B. MPI implementation
The MPI implementation has been designed by following a master-slave scheme, where a master process sends tasks and receives results from slave processes. Concretely, we have a P0 process which loads the data required for solving the problem and sends them to other processes in a broadcast manner. The required data are the APCG and the TG (stored as matrices), the maximum number of iterations that can be performed within the local search without improving the Fitness Function value, and the number of seeds (the number of executions of the local search method. Each method starts from a random mapping (the seed). Figure 5 shows the block diagram of the algorithm proposed for the MPI implementation. The master process performs the data load and sends the needed information to other processes in order to search a good solution. Each process (including the master) calculates and evaluates a random mapping (seed). Then, a new neighbor mapping of the previous iteration (in the case of iteration 0, the previous mapping iteration is the seed) has been calculated and evaluated in each iteration. If the new mapping provides better fitness function value than the previous one, then it becomes the new "best mapping". Finally, a reduction operation is performed when the maximum number of iteration has been reached. P0 selects the best mapping of all the processes.
C. GPU implementation
Two approaches can be followed in order to implement in the GPU a local search method for the topological mapping problem. One of them consists of implementing the whole search procedure on the GPU. The other approach consists of a hybrid implementation where the CPU controls the iterations over the search space and the GPU is in charge of generating the neighborhood, evaluating it and selecting the best neighbor for each iteration. Good programming practices advises the reduction of CPU-GPU synchronization [12] . However, in our case the implementation of the whole search procedure on the GPU requires so much resources per SM (i.e. registers and shared memory) that the level of parallelism is severely limited. Figure 6 shows the block diagram of the implemented solution. The algorithm requires certain configuration parameters such as the number of seeds (from 1 to the number of CUDA blocks), the maximum number of iterations, the number of CUDA threads (between 1 and 1024), the number of CUDA blocks, and both the APTG and TG. This implementation first copies the problem data to the GPU memory, and then it generates the initial (random) mappings in the CPU and copies them to the GPU memory. At this point, the local search is performed in parallel by the GPU blocks for all the initial mappings. When this search has finished, each block copies its best mapping (and its corresponding fitness function value) back to the CPU, who is in charge of selecting the best one among all the best mappings provided by the GPU blocks.
The algorithm is designed to work with a seed number less than or equal to the number of CUDA blocks. For simplicity in the tests performed, both values are equal. This algorithm can also work with a number of threads per block minor than the number of processors in the problem, N . For performance reasons, if N is lower than the maximum number of CUDA threads (512 or 1024 depending of CUDA Compute Capability of the GPU), then the number of threads should be equal to the number of processors. It must be noticed that although figure 6 may suggest that most of the code is executed on the CPU, this is not true, since the block of "Local Search" requires the highest computing power (it is in charge of calculating the "Fitness Function).
For the seed generation block, we developed a solution which created the seeds directly in GPU code by means of NVIDIA Curand library. Moreover, we created another library in which the seeds are created on the CPU and copied at the GPU memory. The performance tests showed that the CPU version was significantly faster than the GPU version.
It is important to emphasize that the "Local Search" includes tasks such as the copy of a mapping (vector of N elements) from a memory location to another, as well as the computing of the fitness function value for each generated mapping. In the case of the algorithm executed on GPU, these functions are performed in parallel by the threads of a CUDA block, while the CPU version did not obtain this degree of parallelism. Each GPU thread copies its mapping position in parallel. For the CPU case, each MPI process should iterate to copy the N mapping positions when a mapping is copied. Regarding the problem of calculating the fitness function, in the GPU implementation, this can be decomposed into so many sub-calculations as threads have the CUDA blocks, thus an extra SIMD parallelism not available in the CPU implementation is obtained.
Therefore we can conclude that the GPU version adds a new level of parallelism, in addition to the equivalent level between the CUDA block and MPI processes.
IV. PERFORMANCE EVALUATION
In this section, we present a comparative study of the results obtained by both the GPU-based implementation and the MPI implementation of the local search method. The GPU-based implementation has been executed on a single C2070 NVidia GPU with a theoretical performance in single precision of 1030 Gflops. The GPU kernel parameters have been tuned in order to find the best configuration for each problem size. In this sense, we have performed executions with a wide range of parameter values, ranging from 8 blocks to 32 blocks, and from 4 threads per block to 16 threads per block. We have considered problem sizes ranging from 64 cores (16 quad-core processors) to 1024 cores (256 quad-core processors). However, due to space limitations we will show here only the results for a representative problem size considered, consisting of mapping 512 tasks onto 64 eight-core processors (512 cores).
The study considers different numbers of seeds and the neighborhood sizes. The neighborhood size is given as the number of iterations that can be performed without improving the fitness value in the current solution computed by the LocalSearch() procedure. For each configuration (i.e., number of seeds and neighborhood size) we have carried out two different measurements: the best fitness values provided and the required execution times. All these measurements have been obtained as the average values provided by ten independent runs of each algorithm. Since the task mapping algorithm should be executed within a limited time, we have limited the neighborhood size (number of iterations to be executed by both implementations) to 30 iterations per seed. Instead, we have tested the same problem with different numbers of blocks (in the case of CPU implementation) and different numbers of MPI processes (in the case of the MPI implementation), in order to test the scalability of the considered implementations. Figure 7 shows the best fitness values obtained by each parallel implementation. It shows on the X-axis the number of iterations considered, and it shows on the Y-axis the best fitness values. Looking at this figure, it can be seen that there is no a clear behavior in neither of the plots. Depending on the number of processes or CUDA blocks used, the fitness function values obtained are significantly different, but there are no a clear region of the X-axis where the best values are provided. Also, none of the implementations provides a clearly better behavior. Therefore, we can conclude that both implementations provide mappings with similar quality when the the depth of the search is bounded. Figure 8 shows the execution times required for a bigger problem size (1024 tasks to be mapped onto 256 quadcore processors), but reducing the scale of the Y-axis in two orders of magnitude. This figure clearly shows that for all the numbers of MPI processes considered, the GPU implementation requires significantly shorter execution times, and this difference increases as so does the parallelism in the system (the number of CPU blocks and/or the number of MPI processes). In this way, the time requitred by the MPI implementation is twice the time required by the GPU implementation when 96 CUDA blocks and MPI processes are used, respectively.
V. CONCLUSIONS
In this paper, we have proposed a comparative study of parallel implementations of the local search method used in both task mapping and topological mapping on different architectures with known theoretical performances. Unlike other comparative studies of heuristic methods implemented on GPUs, we compare the actual performance provided by the parallel version for GPUs with the actual performance provided by the MPI parallel version executed on a cluster computer. The performance evaluation results show that the GPU implementation provides mappings with similar quality than the MPI implementation, but the execution times required for providing these solutions are significantly shorter than the ones required for providing the solutions in the MPI implementation. Moreover, the differences in the execution times increases as so does the parallel system size. Therefore, these results validate the GPU implementation as a scalable and very cost effective accelerator for parallel implementations of Communication-Aware Task Mapping Techniques.
