Abstract. We have addressed in this paper the implementation of redblack multigrid smoothers on high-end microprocessors. Most of the previous work about this topic has been focused on cache memory issues due to its tremendous impact on performance. In this paper, we have extended these studies taking Simultaneous Multithreading (SMT ) into account. With the introduction of SMT, new possibilities arise, which makes highly advisable a revision of the different alternatives. A new strategy is proposed that focused on inter-thread sharing to tolerate the increasing penalties caused by memory accesses. Performance results on an IBM's Power5 based system reveal that our alternative scheme can compete with and even improve sophisticated schemes based on tailored loop fusion and tiling transformations aimed at improving temporal locality.
Introduction
Multigrid methods are regarded as being the fastest iterative methods for the solution of the linear systems associated with elliptic partial differential equations, and as amongst the fastest methods for other types of integral and partial differential equations [16] . Fastest refers to the ability of Multigrid methods to attain the solution in a computational work which is a small multiple of the operation counts associated with discretizing the system. Such efficiency is known as textbook multigrid efficiency (TME) [15] and has made multigrid one of the most popular solvers on the niche of large-scale problems, where performance is critical.
Nowadays, however, the number of executed operations is only one of the factors that influences the actual performance of a given method. With the advent of parallel computers and superscalar microprocessors, other factors such as inherent parallelism or data locality (i.e. the memory access behavior of the algorithm) have also become relevant. In fact, recent evolution of hardware has exacerbated this trend since:
-The disparity between processor and memory speeds continues to grow despite the integration of large caches. -Parallelism is becoming the key of performance even on high-end microprocessors, where multiple cores and multiple threads per core are becoming mainstream due to clock frequency and power limitations.
In the multigrid context, these trends have prompted the the development of specialized multigrid-like methods [1, 2, 10, 5] , and the adoption of new schemes that try to bridge the processor/memory gap by improving locality [14, 18, 7, 4, 8] . Our focus in this paper is the extension of this cache-aware schemes to Simultaneous Multithreading (SMT) processors.
As its name suggests, SMT architectures allows several independent threads to issue instructions simultaneously in a single cycle [17] . Its main goal is to yield better use of the processor's resources, hiding the inefficiencies caused by long operational latencies such as memory accesses. At first glance, these processors can be seen as a set of logical processors that share some resources. CWith HT, the Intel Pentium 4 behaves as two logical processors sharing some resources (Functional Units, Memory Hierarchy, etc). The exploitation of this additional level of parallelism has been performed in this work by means of OpenMP directives, which are directly supported by the Intel ICC compileronsequently, one may think that optimizations targeted for Symmetric Multiprocessors (SMP ) systems are also good candidates for SMT. However, unlike SMP systems, SMT provides and benefits from fine-grained sharing of processor and memory resources. On the other hand, unlike conventional superscalar architectures, SMT exposes and benefits from thread level parallelism when hiding latencies. Therefore, optimizations that are appropriate for these conventional machines may be inappropriate or less effective for SMT [9] .
Unfortunately, SMT potentials are not yet fully exploited in most applications due to the relative underdevelopment of compilers, which despite many improvements still lag far behind. Due to this gap between compiler and processor technology, applications cannot benefit from SMT hardware unless they are explicitly aware of thread interactions. In this paper, we have revisited the implementation of multigrid smoothers in this light. The popularity of multigrid makes this study of great practical interest. In addition, it also provides certain insights about the potential benefits of this relatively new capability and how to take advantage of it, which could ideally help to develop more efficient compiler schemes.
The organization of this paper is as follows. We begin in Sections 2 and Section 3 by briefly introducing multigrid methods and describing the main characteristics of our target computing platform respectively. In Section 4 we describe the baseline codes used in our study for validation and assessment. They are based on the DIME project (DIME stands for Data Local Iterative Methods For The Efficient Solution of Partial Differential Equations) [3, 14, 18, 7, 4, 8] , which is one of the most outstanding and systematic studies about the optimization of multigrid smoothers. Afterwards, in Section 5, we discuss our SMT -aware implementation. Performance results are discussed in Section 6. Finally, the paper ends with some conclusions and hints for future research.
Multigrid Introduction
This section provides a brief introduction about multigrid, defining basic terms and describing the most relevant aspects of these methods so that we have a basis on which to discuss some of the performance issues.
The fundamental idea behind Multigrid methods [16] is to capture errors by utilizing multiple length scales (multiple grids). They consist of the following complementary components:
-Relaxation. The relaxation procedure, also called smoother in multigrid lingo, is basically a simple (and inexpensive) iterative method like Gauß-Seidel, damped Jacobi or block Jacobi. Its election depends on the target problem, but if well chosen, it is able to reduce the high-frequency or oscillatory components of the error in relatively few steps. -Coarse-Grid Correction. Smoothers are ineffectual in attenuating low-frequency content of the error, but since the error after relaxation should lack the oscillatory components, it can be well-approximated using a coarser grid. On that grid, errors appear more oscillatory and thus the smoother can be applied effectively. New values are transferred afterwards to the target grid to update the solution.
The Coarse-Grid Correction can be applied recursively in different ways, constructing different cycling strategies. Algorithm 1 shows the pseudo-code of one of the most popular choices, known as V-cycle due to its pattern. This algorithm telescopes down to a given coarsest grid, and then works its way back to the target finest grid. The transfer operators I 2h h and I h 2h connect the grids levels: I 2h h is known as the restriction operator and transfers values from a finer to a coarser level, whereas I h 2h is known as the prolongation operator and maps from a coarser to a finer level.
The most time-consuming part of a multigrid method is the smoother and hence is the primary parameter in optimizing the solve time for a particular problem. In this initial study we have focused on point-wise smoothers. Block smoothers [12, 11] are more efficient in certain problems but are beyond the scope of this paper and will not be addressed at this time. The discussion about the implementation of point-wise smoothers is taken up in Section 4.
Experimental Platform
Our experimental platform consists in an IBM's Power5 processor running under Linux, the main features of which are summarized in Table 1 . This processor has introduced SMT to the IBM's Power family [6] . With this design, each core of this dual-core processor appears to software as two logical CPUs, usually denoted as threads, that share some resources such as functional units or the memory hierarchy.
Apart from SMT, we should also highlight the impressive memory subsystem of the Power5. The memory controller is moved on chip and the main memory is directly connected to the processor via three buses: the address/command bus, the unidirectional write data bus, and the unidirectional read data bus. The 36-MB off-chip L3 has been removed from the path between the processor to the memory controller and operates as a victim cache for the L2. This means that data is transferred to the L3 only when it is replaced from the L2.
Finally, it is worth to mention that the exploitation of SMT has been performed in this work by means of OpenMP directives, which are directly supported by the IBM's FORTRAN compiler. Single thread performance have been measured using a sequential code and enabling the Power5 single-threaded mode of the Power5, which gives all of the chip resources to one of the logical CPUs.
Cache-aware Red-Black Smoothers
Gauß-Seidel has long been the smoother of choice within multigrid on both structured and unstructured grids [1] . Although, it is inherently sequential in its natural form (the lexicographic ordering), it is possible to expose parallelism by applying multi-coloring, i.e. splitting grid nodes into disjoint sets, with each set having a different color, and updating simultaneously all nodes of the same color.
The best known example of this approach is the red-black Gauß-Seidel for the 5-point Laplace stencil, which is schematically illustrated in Figure 1 . For the 9-point Laplacian, a red-black ordering may lead to a race condition (depending on the implementation), and at least a four color ordering of the grid space is needed to decouple the grid nodes completely. Apart from exposing parallelism, multi-coloring also impacts on the convergence rate, but unlike other techniques such as block Gauß-Seidel (i.e. applying Gauß-Seidel locally on every processor), the overall multigrid convergence rates remain satisfactory.
Unfortunately, multi-coloring deteriorate the memory access and may lead to poor performance. Algorithm 4 shows the pseudo-code of a red-black smoother, denoted as rb1 by the DIME project. This naïve implementation performs a complete sweep through the grid for updating all of the red nodes, and then another complete sweep for updating all of the black nodes. Therefore, rb1 exhibits lower spatial locality than a lexicographic ordering. Furthermore, if the target grid is large enough, temporal locality is not fully exploited.
Algorithm 2 Red-Black Gauß-Seidel naïve implementation
for it=1,nIter do // red nodes: for i = 1; n-1 do for j = 1+(i+1)%2; n-1; j=j+2 do Relax point( i,j ) end for end for // black nodes: for i = 1, n-1 do for j = 1+i%2; n-1; j=j+2 do Relax point( i,j ) end for end for end for Alternatively, some authors have successfully improved cache reuse (locality) using loop reordering and data layout transformations that were able to improve both temporal and spatial data locality [13, 18] .
Following these previous studies, in this paper we have used as baseline codes the different red-black smoothers developed within the framework of the DIME project. To simplify matters, these codes are restricted to 5-point as well as 9-point discretization of the Laplacian operator. Figures 2-4 illustrate some of them, which are based on the following observations:
-The black nodes of a given row i − 1 can be updated once the red nodes of the i row has been updated. This is the idea behind the DIME's rb2 (see figure 2 ) and rb3 schemes, which improve both temporal and spatial locality fusing the red and black sweeps. -If several successive relaxation have to be performed, additional improvements can be achieved transforming the iteration traversal so that the operations are performed on small 1D or 2D tiles of the whole array. Data within a tile is used as many times as possible before moving to the next tile. DIME's rb4-rb9 schemes perform different 1D and 2D tiling transformations. Figures  3 and 4 illustrate the rb5 and rb9 schemes respectively. Tables 2 and 3 show the MFlops achieved by DIME's rb1-9 codes on our target platform. The speedup of the best transformation range from 1.2 to 1.8 for the 5-point stencil, and from 1.15 to 1.25 for the 9-point version. Our first insight is that these gains are lower than on other architectures. For instance, the improvements on a DEC PWS 500au reported on DIME's website reach a factor of 4 [3] . Furthermore, the sophisticated two-dimensional blocking transformation DIME's rb9 does not provide additional improvements, being DIME's rb7 and sometimes DIME's rb2 the most effective transformations. Fig. 2 . DIME's rb2. The update of red and black nodes is fused to improve temporal locality. The main reason behind this difference in behavior is the relatively large amount of on-chip and off-chip caches included in the IBM's Power5, as well as their higher degree of associativity.
SMT-aware Red-Black Smoothers
The availability of SMT introduces a new scenario in which thread-level parallelism can also be applied to hide memory accesses. As mentioned above, SMT processors can be seen as a set of logical processors that share execution units, systems buses and the memory hierarchy. This logical view suggests the application of the general principles of data partitioning to get the multithreaded versions of the different DIME variants of the red-black Gauß-Seidel smoother. This strategy, which can be easily expressed with OpenMP directives, is suitable for shared memory multiprocessor. However, in a SMT microprocessor, the similarities amongst the different threads (they execute the same code with just a different input dataset) may cause contention since they have to compete for the same resources.
Alternatively, we have employed a dynamic partitioning where computations are broken down into different tasks with are assigned to the pool of available threads. Intuitively, the smoothing of the different colors is interleaved by assigning the relaxation of each color to a different thread. This interleaving is controlled by a scheduler, which avoids race conditions and guarantees a deterministic ordering.
Algorithm 3 shows a pseudo-code of this approach for red-black smoothing. Our actual implementation is based on the OpenMP's parallel and critical directives. The critical sections introduce some overhead but are necessary to avoid race-conditions. However, the interleaving prompted by the scheduling allows the black thread to take advantage of some sort of data prefetching since it processes grid nodes that have just been processed by the red thread, i.e. the red thread acts as a helper thread that performs data prefetching for the black one. This interleaved approach can also be combined with DIME's rb4-8 variants. If two successive iterations have to be performed, the intuitive idea is that one thread performs the first relaxation step whereas the other performs the second one. The scheduler guarantees again a deterministic ordering.
In the next section we compare the performance of this novel approach over traditional block-outer, cyclic-outer and cyclic-inner distributions of the relaxation nested loop. For the naïve implementation, all of them are straightforward. However, for DIME's rb2-8 variants, the cyclic-outer version is nondeterministic, whereas the block-outer requires the processing of block boundaries in advance.
We have omitted a parallel version of DIME's rb9 since even in the sequential setting, that version does not provide superior performance over DIME's rb5-8. We have also omitted block-outer and cyclic-outer distributions of the red-black smoother for the 9-point stencil, since they are also non-deterministic. Note, however, that both the cyclic-inner and our interleaved approach avoid raceconditions. Figure 5 shows the speedup achieved by the different parallel strategies over the baseline code (with the best DIME's transformation) for the the 5-point stencil. Fig. 5 . Speedup achieved by different parallel implementations of a red-black Gauß-Seidel smoother for a 5-point Laplace stencil. Sched denotes our strategy, whereas DP and Cyclic denote the best block and a cyclic distribution of the smoother's outer loop respectively. MELT is the number of successive relaxations that have been applied.
As can be noticed, the election of the most suitable strategy depends on the grid size:
-For small and medium grid sizes block and cyclic distributions outperforms our approach, although for the smallest sizes none of them is able to improve performance. This is the expected behavior given that for small and medium working sets, memory bandwidth and data cache exploitation are not a key issue and traditional strategies beats our approach on performance due to the overheads introduced by the dynamic task scheduling. -For large sizes we observe the opposite behavior given that the overheads involved in task scheduling become negligible, whereas the competition for memory resources becomes a bottleneck in the other versions. In fact, we should highlight that the block and cyclic distributions become clearly inefficient for large grids.
-The break-even point between the static distributions and our interleaved approach is a relative large grid due to the impressive L3 cache (36 MB) of the Power5. Figure 6 confirms some of these observations for the the 9-point stencil. Furthermore, the improvements over DIME's variants are higher in this case, since this is a more demanding problem. Fig. 6 . Speedup achieved by different parallel implementations of a red-black Gauß-Seidel smoother for a 9-point Laplace stencil. Sched denotes our strategy, whereas Cyclic denotes the best cyclic distribution of the smoother's inner loop. MELT is the number of successive relaxations that have been applied.
Conclusions
In this paper, we have introduced a new implementation of red-black Gauß-Seidel Smoothers, which on SMT processors fits better than other traditional strategies. From the results presented above, we can draw the following conclusions: -Our alternative strategy, which implicitly introduce some sort of tiling amongst threads, provide noticeable speed-ups that match or even outperform the results obtained with the different DIME's rb2-9 variants for large grid sizes. Notice that instead of improving intra-thread locality, our strategy improves locality taking advantage of fine-grain thread sharing.
-For large grid sizes, competition amongst threads for memory bandwidth and data cache works against traditional block distributions. Our interleaved approach performs better in this case, but suffers important penalties for small grids, since its scheduling overheads does not compensate its better exploitation of the temporal locality. Given that multigrid solvers process multiple scales, we advocate hybrid approaches.
We are encouraged by these results, and based on what we have learned in this initial study we are proceeding with:
-Analyzing more elaborated multigrid solvers.
-Combining interleaving with grid partitioning distributions to scale beyond two threads. The idea is to use grid partitioning to distribute data amongst a large scale system, and interleaving to exploit thread level parallelism inside their cores.
