The objective of this work is to get benefit of advancement in GPU technologies in the state of art software framework. We have analyzed the existing map-reduce (MR) framework and modify the same for new GPU architectures. We have identified some significant possibilities for improvement. These improvements are mainly in the context of the different GPU architectures, which were introduced after the development of the MR framework. Our experiments show an average of 2.5x speedup of MR framework on these architectures. Cache reconfiguration is also investigated in this work. We have achieved performance benefit ranging from 10% to 200% for various cache sizes.
Introduction
High performance computing (HPC) is a domain where speed up of different compute intensive applications is the objective. Such applications include bio-informatics, weather forecasting, image processing, cryptography, web applications etc. Many hardware platforms and clusters have been developed with a variety of features and capabilities [1] [2] [3] . These platforms contain many-core as well as heterogeneous architectures. Due to power and temperature related issues, the increasing CPU frequency to improve performance is not a viable option. Possible alternatives for this problem are many core processors, specialized accelerators and re-configurable architectures [4] . Thus mapping computation intensive tasks to either dedicated hardware or highly parallel Graphics Processing Units (GPUs) may be needed to reach performance levels required by many applications within the given power constraints.
Kernel is the part of the application where large amount of data level parallelism exist and it has major contribution in over all execution time of the application. Mapping of kernels to an accelerator is an iterative process before a near optimal mapping can be obtained. Map-Reduce (MR) can relieve the programmer from the burden of application specific mapping because of its generic and abstract mapping interface. MR is dedicated to processing large distributed data sets and mainly used for web applications [5] . This is becoming popular for general purpose computing on GPU as well. MR framework works in two phases; map phase and reduce phase. Map phase is responsible for the execution of kernel in parallel on all cores of GPU. Reduce phase may not have as many instances as map phase but they are responsible for accumulating the output of map phase into the final result. All inputs and outputs of map phase and reduce phase are in the the form of Key/Value pair. This makes the model quite generic. MR framework provides a very easy programming platform for GPGPU programmer. It helps the programmer in thread configuration and data communication. It provides uniform interface structure irrespective of the application [6] . But this benefit comes at the small cost of performance penalty. Motivation behind this work is to get the benefit of MR on emerging GPU architectures while minimizing, if not eliminating the performance penalty.
Our study and analysis here presents comparative merits and demerits of MR framework for different GPU architectures. We have studied and experimented with the MARS MR framework [6] developed for traditional GPUs before exploring its behavior on the Fermi architecture. We have compared the MARS with CUDA based implementation (onlyCU DA) for eight different applications. Our work reports the performance penalty incurred, when MR framework is introduced. The following are the other contributions of this work 1. We have analyzed the execution time of kernel functions on current GPUs. This helps to identify the performance bottleneck.
2. We have identified the new GPU architecture features with a view to speedup execution of MR framework. 
Related work
Due to power and temperature related issues, increasing CPU frequency beyond a limit to improve performance is not a viable option. Possible alternatives for increasing performance are many core processors, specialized accelerators and re-configurable architectures. FPGAs and GPUs have been employed for accelerating general purpose computing apart from embedded applications and graphics applications respectively [7] [8].
First version of MR is designed by Dean J. et.al [5] . This paper explains the different phases and data movement across the distributed computing environment. One of the MR frameworks for GPU was released by Bingshang et.al [6] in 2008. This work was compared with Phoenix framework for CPUs and CPU-based implementation. They obtained a maximum 5.5x and 7x speedup respectively. In the same stream of research, Stuart et.al [9] have presented their work in 2010. This work was focused on Volume rendering using MR with multiple GPUs. They have used Accelerator Cluster (AC) for their experiments, where multiple Tesla C1060 GPUs are connected to form clusters. In continuation with this work, Stuart et.al [10] have released a library, named GPMR, for MR on multi-GPUs cluster. They have demonstrated how MR tasks are easily modified to fit into GPMR and leverage a GPU cluster. Recently, Feng ji et.al [11] tried to exploit shared memory in MR framework. They have used shared memory to buffer the input/output data. Objective of this work was to reduce memory traffic to global memory. All the above mentioned research has been carried out on older GPUs. The new GPU architectures, Fermi and Kepler, comes with a large number of new features that are not considered in these MR frameworks.
GPU architectures
NVIDIA has released a new generation of GPUs in 2009 and named it Fermi [12] . Configurable L1 and unified L2 cache are the main components of the new memory subsystem. L1 cache can be configured at compile time as either 48KB or 16KB. Total space of L1 and shared memory is 64KB.
Fermi GPU has compute capability of 2.x. The CUDA cores of GPU are distributed in a number of streaming multiprocessors (SMs). Each CUDA core has integer and floating point units. Cores within the SMs are further grouped into two execution units. The GigaT hread [12] is a two level thread scheduler in Fermi. At the first level, it schedules blocks to the SMs and at the second level, warps of the blocks are scheduled on its execution unit. An instruction for a warp is executed in two steps; in the first step, half of the warp is scheduled and in the second step, other half of the warp executes that instruction. Therefore, an integer or floating-point arithmetic instruction takes two clock cycles to execute for all threads in a warp. Features of 9800GT, GTX590, Quadro600 and GTX690 [12] [13] [14] are shown in Table 1 . Kepler is the next generation GPU from NVIDIA that is launched in 2012 [13] . This architecture has similar features as Fermi architecture but in rich in amount of resources. In addition to this it has dynamic parallelism feature. GTX690 is one example of Kepler architecture. We have obtained similar results on GTX590 and GTX690.
NVIDIA's CUDA C is a C like programming environment to write a general purpose application for GPU. We have used CUDA 4.0 for development of test cases and applications in this research. This is required to test as well as validate the results.
Brief overview of MARS
We have used MARS MR framework [6] for the analysis of behavior of different GPUs. Important phases of MARS are mapperCount, pref ixSum, mapper, group, reducerCount and reducer. Traditional GPUs did not support the dynamic memory allocation and output of the map and the reduce phase of MR framework are unknown. The size of output data from mapper and reducer is calculated using mapperCount and reducerCount functions respectively. The pref ixSum totals the size given by mapperCount or reducerCount functions. The group phase is involved after mapper, specifically if required by the application. The purpose of this phase is to sort the keys of the map phase. The kernel to be accelerated is to be implemented into the mapper function. The reducer function is responsible for accumulating the output of map and to generate the final output. Figure 1 is a abstract view of MARS.
Data accesses and flows across the phases are in the form of key/value pairs. Structure of the key/value pair is decided by the user and it varies from application to application e.g. in matrix multiplication, key is the pointer to 
Overview of application
We have analyzed and experimented with eight different applications on different architectures. It is implemented with a view to get a feel of ease of using MARS for new applications. Image smoothing is an application taken from the image processing domain. Each element of the array represents a pixel of an image. Corresponding pixel in the output array is computed by calculating the average of nine pixels; center pixel taken along with the eight neighboring pixels. This results in a smoother image in the output array. Map function takes pointer of input array as a key and pixel coordinates as a value. Output of the map function is the average value of pixels (in the neighborhood) with the corresponding coordinates.
Experiments and nalysis
We have performed different sets of experiments to compare the architectures, explore MARS and study the impact of Fermi architecture. A nvidia 9800GT is taken as a representative architecture of conventional GPUs whereas Quadro600 and GTX590 as typical architecture from FERMI family. The results of eight different applications are shown in this section. The x-axis of the graphs are representing the size of the data used in the applications. The maximum size was constrained by the limit on the global memory size in the GPU card. We have performed the following three sets of experiments.
onlyCU DA execution
To find out the benefit of using MARS, we have implemented four applications without MARS. This implementation is referred to in this paper as onlyCU DA. The onlyCU DA method uses the CU DA C as the language of implementation. This is the most common way to port any application on NVIDIA GPUs for general purpose computing. We have ported string matching, image smoothing, matrix multiplication and page view count applications on different GPUs. Figure 2 shows the comparison graphs between 9800GT and GTX590. The results shows that the GTX590 is 1.4x to 6x faster compared to 9800GT.
MARS execution
In this experiment, applications are ported on 9800GT, Quadro600 and GTX590 GPUs with the help of MARS MR framework. Figure 3 shows the execution times for eight different applications. The Quadro600 GPU is on an average 1.8x slower as compared to the 9800GT. On the other hand GTX590 is on an average 5x faster as compared to Quadro600 and on an average 2.5x faster than 9800GT. Variations in the architecture have imposed both positive as well as negative impacts on the performance. These variations are explained in the following sub-sections. 
Number of SMs
Quadro600 GPU has only two SMs while 9800GT and GTX590 have fourteen and sixteen SMs respectively. Impact of these resources can be demonstrated by analyzing the results of one of the application. Consider the MM application; the maximum number of blocks launched is 16384. Table 2 shows that the degree of parallelism is reduced in Quadro600 GPU by the factor of seven and marginally increased in GTX590 with respect to 9800GT. Speed degraded only by an average of 1.8x in case of Quadro600 whereas speedup gain was by an average of 2.4x in GTX590 with respect to 9800GT. Figure 5 . Effect of L1 cache re-configuration
Memory Bandwidth
The GTX590 has more than six times memory bandwidth compared to Quadro600 while the 9800GT has only two times. Memory bus width and memory clock rates vary across different architectures. The MARS MR framework has many kernels that are highly memory sensitive therefore MARS MR framework gives better results on GTX590 Fermi architecture. In the group phase, MARS uses bitonic sort algorithm as a parallel sort. Bitonic sort algorithm has parallel butterfly architecture and sequential stages. The memory-level and thread-level parallelism can hide most of the memory latencies [15] . But this hide is effectively limited due to the barrier at the end of stages. In other words, this barrier leads to poor overlapping between computation time and memory access time. Processors cannot start next stage execution until the current stage finishes its execution. In this situation the architecture with higher memory bandwidth wins the performance race. Figure 4 shows group phase has better speed up with respect to overall speed up of different applications, except WC. In WC, group time is 92% of overall time so group speedup is nearly same as overall speedup.
L1 re-configuration effect
L1 cache of Fermi GPUs can be configured in three different ways. L1 can be switched 'OFF' at compile time and/or size of L1 can be set to 16KB or 48KB at run time. Default cache size of L1 is 16KB. Run time setting can be done before kernel is launched and it will remain in this setting till the kernel does not exit. Also, this setting is applicable for the subsequent kernels launches, until a new setting is not imposed. Effect of L1 configuration is shown in Figure 5 . Performance variation ranges from 10% to 200% with different configurations for different applications. L1 with 16KB size results in worst performance while 48KB size and switched 'OFF' L1 are giving nearly equal performance boost. One possible reason is that the conflict cache misses are reduced significantly for 48KB with respect to 16KB. Table 3 shows the results from computeprof for two sizes of L1 cache in MM. Hit ratio varies from 95% to 85% for 48KB where as 79% to 39% for 16KB with the increasing data size. We have referred L1 configuration with 16KB size as with L1 case and bypassing of L1 by OF F L1 case. matrix  16KB  48KB  size  miss  hit  miss  hit  128  2538  9750  523  11765  256  27260  71044  6010  92294  512  376911  421809  49864  711992  1024  3657754  2756582  524669  5766787  2048  30764840  19763416  7193975  43137673 The OF F L1 also helps in performance boosting. This configuration performs 10% to 25% better with 16KB L1 size. Latency of service at each level increases from core to DRAM. Approximate latencies or hit cycles measured here 50, 250 and 1000 cycles at L1, L2 and DRAM respectively. We have used clock() function to measure Figure 6 . Effect of miss rate at L1 and L2 on average memory access cycles these latencies. Expressions 1 and 2 are used to calculate average memory access time of multi level memory hierarchy.
The x-axis in figure 6 represents the %miss ratio on L1 for the with L1 case, keeping L2 miss ratio constant at 50%. The x-axis also represents the L2 %miss ratio for the OF F L1 case. The 50% L2 miss ratio is the average miss ratio observed from computeprof for different applications coded in onlyCU DA. Selection of with L1 or OF F L1 case is dependent on the application. Applications like MM, PVC and IS have higher miss ratio at L1 so they fall above the cross over point. These applications have better performance in OF F L1 case compared to with L1 case.
In the application like SM, very obvious results are produced. Here 48KB case reports best and OF F L1 case reports worst performance figure. In this application, data size accessed in each memory request is very small. Each thread processes a chunk of data. SM kernel scanned the data chunk character wise. It means larger number of memory requests are generated per thread and achieved vary high hit ratio (>95%) for 16KB configuration. This application falls below the cross over point of two graphs in figure 6 . The map function is executed twice to compute the size of output arrays. This leads to further increase in the memory traffic. Therefore L1 with size 16KB and 48KB perform better then bypassing L1. Now, we can say that the size of memory access and density of memory accesses are the key deciding factor for the selection of proper configuration of L1 cache. Our performance metrics and results are very helpful in the selection of the best configuration for implementation. 
Enhancement in MARS
We have identified and evaluated three techniques for performance enhancement in MARS MR framework. Objective of these techniques is to get benefit of new GPU technologies wherever possible. We have identified group phase as the biggest contributor with respect to execution time. In other techniques we tried to refine algorithms to make them more cache sensitive. We discuss each of these techniques in the following sections,
Code restructuring
The applications are launched without considering any adaptation to architectural features of new GPUs. Its consequence is that the cache unaware data accesses generate a large number of cache misses [16] . We can expect much better performance if the applications and MR frameworks are tuned according to the principle of locality. For example, MM can be configured in two ways; row wise access and column wise access. In row wise access, a row of the first matrix is multiplied with a row of the second matrix. This technique was used by MARS to get coalesced memory access on traditional GPUs. But in the GPUs with cache, 128 bytes (line size of L1) from a row are fetched on every L1 miss. Only one element is useful to a instruction (instrx) of a thread (threadx) because GPU executes instructions in SIMD manner. Figure 8 . MARS Enhanced by optimizing group phase other threads of the warp may generate cache misses.
The inner filled rectangle in figure 7 (a) shows data of cache line fetched on cache miss of instr0 of thread0. It also illustrates the data for other instructions of this thread are available but data for same instruction of other threads are not available. Other elements are also useful but they will be used only after the current instruction, instrx, finishes for all threads of the warp.
In column wise access, row of first array is multiplied with the column of second array as shown in Figure  7 (b). On every cache miss, the fetched data is useful for the other threads of the same warp for same instruction. This way we have reduced cache miss by 32% and these counts are given in table 4. These counts are taken from an in-house developed cache simulator and address trace generator. This cache simulator [16] has already been verified with DineroVI.
Group phase enhancement
It was observed from the profiled data that group phase has major contribution (43% to 90%) in the over all time of execution of the application. Fine grained profiling shows that comparison part of bitonic sort in group phase took more then 90% of group time in string comparison. Out of eight benchmarks of MARS framework, five applications are executing group phase. Three applications [Inverted Index (II), Word Count (WC) and Page View Count (PVC)] have string comparison operation while the other two applications i.e. Similarity Score (SS) and Page View Rank (PVR) have number comparison operation. In our refinement, we have reduced the number of comparisons per thread as well as number of memory accesses per thread. In original algorithm of string comparison of MARS, two strings are compared character wise. Therefore total number of comparisons were equal to the number of characters in smaller strings. Also the same number of memory requests are generated.
In our methodology we compared two strings in terms of packed data of four bytes simultaneously. We read data as a chunk of four bytes and assigned to an integer variable. These four characters from the string forms a single integer value. We have saved approximately 75% comparisons. The results of this methodology are shown in the following graphs. Figure 8 shows the two set of two lines in each graph. The x-axis of the graph represents size of data in bytes and y-axis represents time in miliseconds. First set compares group phase with and without optimization (blue and black line respectively) and second set compares overall time with and without optimization (green and red respectively). The II application has string size which is smaller (around 8 to 12byte) and therefore its group phase contribution to over all time is less than the other two applications (WC & PVC). String size in PVC and WC is around 110 bytes and therefore the comparison contributes a fraction to the overall time. Overall speedup graph (in Figure  8) shows that a maximum of 2x speed up and on average of 1.5x speed up is achieved.
The auxiliary functions enhancement
In this section we have discussed the technique for enhancing the mapperCount and reducerCount functions. These functions are treated as auxiliary functions as they are not part of the application code but they play an important role in the map and reduce phases of MARS execution. These functions have three consecutive read and write instructions which means a total of six instructions where read and write are called alternatively. When a read instruction is called through a warp, one miss can satisfy read requirements of all other threads of the warp for this read instruction. But next instruction is write which makes the whole line dirty and this remaining threads of the warp generate cache misses. Instead of writing alternatively, we have used delayed writing for these functions. First we read all the information into the shared memory and perform operation on shared memory. Finally we write the result into the global memory.
By this mechanism, we have reduced number of the cache misses per thread from 96 to 3. We use synchronization barriers to ensure that all read operations are completed before start of the write operations. Figure 9 shows the time taken by mapperCount function in unoptimized and optimized coding. This results in a performance improvement of 10% to 25% for these functions. Figure 9 . mapperCount function enhanced by cache sensitive coding
Conclusion
Our results clearly demonstrate the MR can be use universally for different kind of applications with minimum amount of performance penalty. Engineers and developers have to concentrate on the improving the algorithmic parts of the application and deployment of algorithms on distributed architectures is taken care by MR framework. We have reduced the performance gap between onlyCU DA and MRAS implementation of different applications, significantly. E.g. PVC application took 3200ms execution time in onlyCU DA implementation while with original MARS implementation of this application took around 10,000ms of execution time. We have reduced this performance gap to 5000ms with our enhancement techniques.
