Abstract
Introduction
Recently, researchers have shown that the communication performance of dense matrix computation applications can be greatly improved by allowing network resources to be managed by the compiler and using the compiled communication technique [2, 5, 7] . In compiled communication, the compiler determines the communication requirement of a program and manages network resources statically to support efficient communications for the program. A number of compiler issues must be addressed in order to apply the compiled communication technique. First, traditional communication analysis techniques [4, 6, 8] represent the communications in logical forms (logical communications), such as Available Section Descriptor (ASD) [4] , Section Communication Descriptor (SCD) [8] and a linear algebra framework [6] . While these descriptors con- Deriving physical communications from logical communications is a nontrivial task and usually requires approximations due to the variables whose values are unknown at the compile time. Second, due to the limited network resources, the compiler must partition a program into phases such that each phase contains fixed, pre-determined communication patterns that the underlying network can support. Since the network must reconfigure at phase boundaries, to obtain high performance, the compiler must incorporate as many communications as possible without exceeding the network capacity in each phase to reduce the number of reconfigurations at runtime.
We have addressed these issues in the E-SUIF compiler, an extension of the Stanford SUIF compiler [1] . E-SUIF supports compiled communication for HPF-like programs on optical Time-Division-Multiplexing (TDM) networks. While E-SUIF targets optical TDM networks, most of the techniques for supporting compiled communication can be applied to other types of networks. Figure 1 shows the major components in E-SUIF. The first phase in E-SUIF is a traditional communication analyzer that analyzes the logical communication requirement of a program and performs a number of high-level communication optimizations. The second phase, logical to physical processor mapping, derives physical communications from logical communications. The component resource scheduling in E-SUIF determines whether a set of physical communications can be supported by the underlying network, and if the communications can be supported, how the network resources are scheduled to support the communications. The third phase, communication phase analysis, utilizes the resource scheduling algorithms to partition the program into phases such that communications in each phase can be supported by the underlying network and that there is minimal number of phases at runtime. The E-SUIF compiler outputs a program with physical communications, phases and the resource scheduling for each phase. In this paper, we will discuss the techniques used in E-SUIF.
The rest of the paper is organized as follows. Section 2 briefly introduces the background. Section 3 describes algorithms to derive physical communications from SCDs. Section 4 presents the communication phase analysis algorithm. Section 5 evaluates the performance of the algorithms and Section 6 concludes the paper.
Background
We have previously developed a traditional communication analyzer based on a demand driven dataflow analysis technique [8] . The work in this paper is built on top of the analyzer. In this section, we will describe the programming model and the data flow communication descriptor, Section Communication Descriptor (SCD), used in the analyzer.
Programming model
We consider structured HPF-like programs, which contain conditionals and nested loops, but no arbitrary goto statements. The array subscripts are assumed to be of the form i + , where and are invariants and i is a loop index variable. The programmer explicitly specifies the data alignments and distributions. To simplify the discussion, we assume in this paper that all arrays are aligned to a single virtual processor grid template, and the data distribution is specified through the distribution of the template. However, our compiler handles multiple virtual processor grids. Arrays are aligned to the virtual processor grid by simple affine functions. The alignments allowed are scaling, axis alignment and offset alignment. The mapping from a point d in data space to the corresponding pointṽ on the virtual ALIGN (i,j) with VPROCS(2*j, i+2, 1) : x ALIGN (i) with VPROCS(1, i+1, 2) : y (1)DO 100 i = 1, 5 (2) DO 100 j = 1, 5 C 1 : y ; i; 1; i + 1 ; 2 ! 2 j; i + 2 ; 1; ? ;? (3) x(i, j) = y(i)... where N is the size of the dimension. We will use notation block-cyclic(b, p) to denote the block-cyclic distribution with block size of b over p processors for a specific dimension of a distributed array. 
Section Communication Descriptor

Logical to physical processor mapping
The SCD descriptor represents the communication in a logical form and does not provide sufficient information to perform compiled communication that requires the knowledge of the detailed connection requirement of a program. This section describes algorithms to compute physical communications from SCDs. We assume that the physical processor grid has the same number of dimensions as the logical processor grid and use processor grid to denote both physical and logical processor grids. The implication of the lemma is that the algorithm to determine the communication pattern for a SCD can stop when the repetition point occurs. Figure 3 shows the algorithm. The algorithm first checks whether the SCD can be processed. If the SCD does not contain sufficient information, the physical communication is approximated by the All-to-All communication. Otherwise, the algorithm will consider each element in D until the repetition point is found or all elements in D are considered. The algorithm has a time complexity of Op 2 b 2 and can be easily extended to handle the case when the source array has different distribution from the destination array.
One-
Multi-dimensional arrays and multidimensional processor grids
The algorithm to compute physical communications for multi-dimensional arrays and multi-dimensional processor grids is given in Figure 4 . In an n-dimensional processor grid, a processor is represented by an n-dimensional coordinate p 1 ; p 2 ; :::; p n . The algorithm determines all pairs of source and destination processors that require connections.
The algorithm first checks whether the mapping relation can be processed. If one loop induction variable occurs in two or more dimensions in CM:src or CM:dst, the algorithm cannot find the correlation between dimensions in source and destination processors, and the communication 
Communication Phase analysis
The communication phase analysis is carried out in a recursive manner on the high level SUIF representation of a program [1] . SUIF represents a program in a hierarchical manner. A SUIF representation of a program contains a list of nodes, which may in turn contain sub-lists. The nodes that contain sub-lists are called composite nodes. The communication phase analysis algorithm associates two variables, pattern, which represents the communication pat-tern that is exposed from the sub-lists, and killphase, which indicates whether the sub-lists contain phases, for each composite node. pattern and killphase. The algorithm examines all these annotations in each node from back to front accumulating communications. When the accumulated communications exceed network capacity, a phase will be generated. Another case to create a phase is when a killphase annotation is encountered, which indicates there are phases in the sub-lists. 
Performance evaluation
This section evaluates the performance of the compiler algorithms. In the evaluation, we assume that the underlying network is a 8 8 torus with a maximum multiplexing degree of 10. That is, each link in the network can support up to 10 channels.
Programs from the HPF benchmark suite at Syracuse University [3] are used to evaluate the algorithms. The benchmarks and their descriptions are listed in Table 1 Table 2 . Communication phase analysis time programs. However, for small size programs as the benchmarks used, the analysis time is not significant. E-SUIF estimates the set of physical connections in each phase and uses a resource scheduling algorithm to determine the multiplexing degree needed in the network to support all connections in the phase simultaneously. Table 3 shows the precision of the analysis. It compares the average number of connections and the average multiplexing degree in each phase obtained from the compiler with those in actual executions. The number of connections and the multiplexing degree in each phase during execution are obtained by accumulating the connections within each phase at runtime. For most programs, the analysis results match the actual program executions. For the programs where approximations occur, the approximation for the multiplexing degree is more precise than the approximation for the number of connections as shown in benchmark 0022. Since the multiplexing degree determines the communication performance for a communication pattern, the approximation for the multiplexing degree is more important for compiled communication on optical TDM networks.
Conclusion
In this paper, we have presented the compiler analysis technique used in the E-SUIF compiler to support compiled communication. Specifically, we described algorithms that computes physical communications from the Section Communication Descriptors (SCDs), which represent logical 
