In this paper, we investigate the compiler algorithms to support compiled communication in multiprocessor environments and study the benefits of compiled communication assuming that the underlying network is an all-optical Time-DivisionMultiplexing (TDM) network. We present an experimental compiler, E-SUIF, that supports compiled communication for High Performance Fortran (HPF) like programs on all-optical TDM networks, describe and evaluate the compiler algorithms used in E-SUIF. We further demonstrate the effectiveness of compiled communication on alloptical TDM networks by comparing the performance of compiled communication with that of a traditional communication method using a number of application programs.
Introduction
In In this paper, we describe and evaluate the compiler algorithms used in E-SUIF. We further demonstrate the benefits of compiled communication on all-optical TDM networks by comparing the performance of compiled communication with that of a traditional communication method using a number of application programs. Notice that there are many issues in optimizing parallel applications, such as the mapping of processing and data into physical processes. In this paper, however, we focus on optimizing communication through compiled communication and assume that those issues are addressed by other parts of the compiler.
The rest of the paper is organized as follows. Section 2 describes the related work. Sec- 
Related work
Many projects have focused on reducing the communication overheads in the software messaging layer [8, 9, 18] . While this approach is beneficial for all types of communications, it does not expose architectural dependent optimization opportunities to the compiler.
Many parallel compiler projects tried to improve communication performance by generating efficient communication code for distributed memory machines [1, 2, 7, 10, 12, 21] . To simplify the compilation, these compilers use the dynamic communication model and do not exploit the potential of compiled communication. Communication analysis and optimization has been applied in parallel compilers. Early approaches optimize communications in a single loop nest using data dependence information [1, 12] . Later, data flow analysis tech-niques have been developed to obtain information for global communication optimizations [7, 10, 15, 24] . However, the analysis only obtains logical communication information, which is insufficient for compiled communication. The interaction between other components, such as program partitioning and data mapping, and the communication sub-system, has also been investigated [22] , which offers another dimension for communication optimization.
Research on compiled communication has drawn the attention of a number of research groups [4, 6, 11, 16] . In [4] , the compiler applies the compiled communication technique specifically to the stencil communication pattern. In [16] , a special purpose machine is designed to support compiled communication. Compiled communication was proposed for a general purpose machine using a multistage interconnection network [6] . However, since the multi-stage interconnection network without multiplexing has very limited capacity to support connections, compiled communication results in excessive synchronization overhead.
The work in [11] proposed to amortize the startup overhead using long-lived connections and to perform architecture-dependent communication optimizations. However, the compiler in 
Programming Model
We consider structured HPF-like programs, which contain conditionals and nested loops, but no arbitrary goto statements. The array subscripts are assumed to be of the form α * i + β, where α and β are invariants and i is a loop index variable. The programmer explicitly specifies the data alignments and distributions. To simplify the discussion, we assume in this paper that all arrays are aligned to a single virtual processor grid template, and the data distribution is specified through the distribution of the template. For example, in the program in Figure 2 , shared arrays x and y are aligned to V P ROCS. E-SUIF handles multiple virtual processor grids.
Arrays are aligned to the virtual processor grid by simple affine functions. The alignments allowed are scaling, axis alignment and offset alignment. The mapping from a point d in the data space to the corresponding point v in the virtual processor grid is specified by an alignment matrix M and an alignment offset vector α, that is, v = M d + α. The distribution of the virtual processor grid can be cyclic, block or block-cyclic. Assuming that there are p processors in a dimension, and the block size of that dimension is b, the virtual processor v is in physical processor
where N is the size of the dimension. We will use the notation block-cyclic(b, p)
to denote the block-cyclic distribution with block size of b over p processors for a specific dimension of a distributed array. will discuss these two steps.
Communication analysis

Logical communication analysis
E-SUIF uses a demand driven communication analyzer [24] to analyze the logical communication requirement of a program. The analyzer performs message vectorization, global redundant communication elimination and global message scheduling [7] optimizations and represents logical communications using Section Communication Descriptors (SCDs). In the rest of this subsection, we will describe SCD and how SCDs are used to represent logical communications. Details about the communication optimizations and the analyzer can be found in [24] . 
Physical communication analysis
This subsection describes the algorithms to compute physical communications from SCDs.
We assume that the physical processor grid has the same number of dimensions as the logical processor grid. Processor grid will denote both physical and logical processor grids. Notice that this is not a restriction because a dimension in the physical processor grid can always be collapsed by assigning a single physical processor to that dimension. Notice also that calculating physical communication does not take the network topology into consideration.
One-Dimensional arrays and one-dimensional processor grids
Let us consider the case where the distributed array and the processor grid are one-dimensional. . The logical destination processor can be computed by first solving the
and then replacing the value of i in dst to obtain the logical destination processor γ * (M A * n + v A − β)/α + δ. Thus, the physical destination processor is given by
The array region D may need to be expanded using the communication qualifier Q or using the range for a loop index variable when the communication is inside a loop. The physical Lemma: Assume that the template is distributed using the block-cyclic(b, p) distribution.
has the same source and destination as the communication for
The implication of the lemma is that the physical communication pattern for a SCD can be determined by examining the communications for at most p 2 b 2 elements. In addition, when the upper bound of D is unknown, the communication pattern can be approximated by considering all elements up to the repetition point. Figure 3 shows the algorithm to compute the physical communication pattern for a 1-dimensional array and an 1-dimensional virtual processor grid. Given D = l : u : s and
if (the form of CM cannot be processed) then return all-to-all connections end if pattern = {( * , * , ..., * ) → ( * , * , ..., * )} for each dimension i in array A do Let sd be the corresponding dimension in source processor grids. Let dd be the corresponding dimension in destination processor grids.
, ⊥) end if pattern = cross product(pattern, 1dpattern) end for pattern = source processor constants(pattern) for each element i in the mapping qualifier do Let dd be the corresponding destination processor dimension. algorithm can be designed.
Multi-dimensional arrays and multi-dimensional processor grids
The algorithm to compute physical communications for multi-dimensional arrays and multidimensional processor grids is given in Figure 4 . In the algorithm, we use the notion ⊥ to represent a "don't care" parameter. In an n-dimensional processor grid, a processor is represented by an n-dimensional coordinate (p 1 , p 2 , ..., p n ). The algorithm determines all pairs of source and destination processors that require communication by reducing the problem into computing 1-dimensional communication sub-problems.
The first step in the algorithm is to check whether the mapping relation can be processed.
If one loop induction variable occurs in two or more dimensions in CM.src or CM.dst, the algorithm cannot find the correlation between dimensions in source and destination processors, and the communication pattern for the SCD is approximated by all-to-all connections.
If the SCD passes the mapping relation test, the algorithm initializes the communication 
Performance of the compiler algorithms
This section evaluates the performance of the compiler algorithms for compiled communication in E-SUIF. We evaluated both the efficiency and the effectiveness of the algorithms.
We use the greedy scheduling algorithm [23] in the evaluation and assumed that the underlying network is an 8 × 8 torus with a maximum multiplexing degree of 10. Note that the network topology, the network size and the multiplexing degree affect the performance of We use benchmarks, listed in Table 1 , from the HPF benchmark suite [13] at Syracuse University in the evaluation. In the table, the data distributions of the major arrays are obtained from the original benchmark programs. Table 2 E-SUIF conservatively estimates the set of physical connections in each phase and uses a resource scheduling algorithm to determine the multiplexing degree needed for each phase. protocol [25] to reserve a path for each connection before the transmission of data takes place. The forward reservation protocol works as follows. When the source node wants to send data, it first sends a reservation packet towards to destination to establish a lightpath.
The reservation fails if the wavelength in some link along the path is not available. Otherwise, the lightpath will be established between the source and the destination, and the destination will notify the source that the lightpath has been established. After that, the source can start sending data. We made the following general assumptions in the performance study. The network topology is an 8 × 8 torus with variable multiplexing degrees. XY routing is used to establish the connections in the torus. We assume that the physical identifiers of the processor follow the row-major numbering and the mapping function between logical processor numbers and physical identifiers is the identity function. Other assumptions that are specific to each experiment will be specified in the experiment.
Two sets of experiments are performed in this study. The first set of experiments uses hand-coded application programs whose communications are highly optimized for the Cray- 
Hand-coded parallel application programs
This set of programs includes three programs, GS, T SCF and P 3M . These programs are well designed parallel applications. The communication performance is highly tuned for the Cray-T3E. The GS benchmark uses Gauss-Siedel iterations to solve Laplace equation on a discretized unit square with Dirichlet boundary conditions. The T SCF program simulates the evolution of a self-gravitating system using a self consistent field approach. P 3M performs particle-particle particle-mesh simulation. 
HPF benchmarks
Conclusion
In this paper, we present the E-SUIF compiler that supports compiled communication for Networks. Rajiv is a member of ACM and a senior member of IEEE.
