A Framework for Integrated Communication and I/O Placement by Bordawekar, Rajesh et al.
Syracuse University 
SURFACE 
Electrical Engineering and Computer Science College of Engineering and Computer Science 
1996 
A Framework for Integrated Communication and I/O Placement 
Rajesh Bordawekar 
Syracuse University 
Alok Choudhary 
Syracuse University, Electrical and Computer Engineering Department 
J Ramanujam 
Louisiana State University, Electrical and Computer Engineering Department 
Follow this and additional works at: https://surface.syr.edu/eecs 
 Part of the Computer Engineering Commons 
Recommended Citation 
Bordawekar, Rajesh; Choudhary, Alok; and Ramanujam, J, "A Framework for Integrated Communication 
and I/O Placement" (1996). Electrical Engineering and Computer Science. 157. 
https://surface.syr.edu/eecs/157 
This Article is brought to you for free and open access by the College of Engineering and Computer Science at 
SURFACE. It has been accepted for inclusion in Electrical Engineering and Computer Science by an authorized 
administrator of SURFACE. For more information, please contact surface@syr.edu. 
A Framework for Integrated Communicationand I/O PlacementRajesh Bordawekar1, Alok Choudhary1 and J. Ramanujam21 ECE Dept., 121, Link Hall, Syracuse University, Syracuse, NY 132442 ECE. Dept., Louisiana State University, Baton Rouge, LA 70803Abstract. This paper describes a framework for analyzing dataowwithin an out-of-core parallel program. Dataow properties of FORALLstatement are analyzed and a unied I/O and communication place-ment framework is presented. This placement framework can be appliedto many problems, which include eliminating redudant I/O incurred incommunication. The framework is validated by applying it for optimizingI/O and communication in out-of-core stencil problems. Experimentalperformance results on an Intel Paragon show signicant reduction inI/O and communication overhead.1 IntroductionIt is widely acknowledged in the high-performance computing circles that par-allel input/output requires substantial improvement in order to make scalablecomputers truly usable. There are several reasons for a parallel application forperforming input/output. These include real-time I/O, initial/nal read-write,checkpointing and out-of-core computations [Bor96].We focus on the problem of supporting out-of-core computations. Out-of-core computations are those computations whose primary data sets are storedon les in the secondary memory. Specically, we concentrate on compiling out-of-core programs developed using High Performance Fortran (HPF) [Hig93].3HPF is a data parallel language which provides explicit language directives topartition data over processors in certain pre-dened decomposition patterns likeBLOCK and CYCLIC. This data distribution results in each processor storing alocal array associated with each array distributed in the HPF program. HPFalso provides data-parallel program construsts like FORALL [Hig93].In this paper, we describe a dataow framework for optimizing communica-tion in out-of-core problems. We focus on communication optimization withina single out-of-core FORALL construct. Unlike the available dataow frameworksfor optimizing inter-processor communication [KN94, KS95, GSS95], our frame-work takes an unied approach for placing I/O and communication calls whilepreserving characteristics of these calls. All the current frameworks focus on im-proving communication performance by vectorizing messages, eliminating redun-dant communication and overlapping communication with computation. How-3 Although the techniques are discussed with respect to HPF, they can be applied tocompilation of data parallel programs in general.
ever, these frameworks do not directly extend to out-of-core problems. Anotherlimitation of these frameworks is that they do not make ecient use of thecopy-in-copy-out semantics of the HPF FORALL construct. We illustrate thesepoints by applying two communication placement frameworks [KN94, KS95] toan out-of-core problem performing stencil computations (also called an regularproblem). We then compare the results with an integrated I/O and communica-tion placement framework which achieves substantial performance improvementby simultaneously reordering I/O and communication calls.The paper is organized as follows: Section 2 introduces various dataow def-initions that will be used throughout the paper. In Section 3, we present anout-of-core regular problem and analyzes it's communication and I/O pattern.This problem is used as a running example throughout the paper. Section 4presents an integrated I/O and communication framework and describes its ap-plication in eliminating extra le I/O from communication. Section 5 presentsexperimental performace results of optimizing out-of-core communication fromstencil problems using our framework. Finally, we conclude in Section 6.2 BackgroundOur program representation is based on [KS95]. Let G=(N;E) be the intervalow graph representing an HPF program, with N nodes and E edges. Let s ande be the unique start and end nodes of G. Every edge in E can be classied asan entry, forward or backward edge. Let a Tarjan interval T(h) represent a setof program ow nodes that correspond to a loop in the program text. T(h) hasa unique header h, where h 62 T(h). For every node n of the interval ow graph,G, we dene Succ(n) and Pred(n) as a set of successor and predecessor nodesof n. The edges induce the following traversal order over G. Given a forwardedge (m;n), a Forward order visits m before n and a Backward order visitsm after n. Let Header denote the header node of the interval T(n). [Bor96]describes the properties of the interval ow graph.To anlyze dataow properties of the FORALL statement, we use the classicaldataow denitions, i.e., USE, DEF, KILL. A variable is said to be USEd if it isreferred in an expression. A variable is said to be DEFed if it is initialized inan expression. The variable is said to be LIVE until it is dened again (in otherwords, KILLed). We can extend these denitions for objects such as arrays. Anarray is said be INJURED, if some elements of the array are overwritten, otherwisethe array can be considered LIVE. An array is said to be ACTIVE if some of itselements are either USEd or DEFed and these elements constitute the Activeset of the array.Recall that the FORALL statement has copy-in-copy-out semantics [Hig93].Consequently, during the execution of a FORALL statement, old as well as newvalues of an array can be LIVE. In other words, the FORALL statement satisesthe DELAYED KILL property [Bor96]. We use variable DKILL to represent anarray which satises the DELAYED KILL property.
We now dene some dataow variables that will be used for analyzing com-munication and I/O access patterns in out-of-core programs. Let Activepn de-note the set of elements that will be used in computation in processor p at a noden in the interval ow graph. Similarly Incorepn denote the set of elements readby a processor p at node n. Denitions Activepn and Incorepn are used to com-pute the send-recv sets for each processor, Sendpn and Recvpn. Using Sendpn andRecvpn, we can compute the set of elements communicated at a node n, Commnas SifSendin+Recving. Similarly, we compute the set of incore elements at noden, Incoren, as Si Incorein. For every node n, for every processor p and Sendpn,we dene Eiopn as a set of elements which will be sent by p but are not membersof Incorepn. Formally, Eiopn=Sendpn- (Incorepn T Sendpn).For any data set d 2 Incore or Send or Recv, the following predicates aredened. Bit vectors are used to represent individual data sets.{ Used(n; d)df= True i a subset of d is referenced at node n.{ Kill(n; d)df= True i a subset of d is modied at n.{ Incore(n; d)df= True i a subset of d is in-core at n.3 I/O and Communication Optimization: An ExampleFigure 1:2 presents an HPF example in which an out-of-core array a is dis-tributed over 4 processors in BLOCK fashion. This example will be used as arunning example throughout the paper. Our running example performs one-dimensional relaxation using 3-point stencil computations. The interior pointsof the array a are updated using a FORALL construct. To preserve the FORALLsemantics, it is necessary to use temporaries to store initial and intermediatedata. Since the primary data sets are stored in les, it is necessary to use twodierent les, the source local array le (LAF) for reading initial data and atemporary LAF to store the updated intermediate data. After the computationis over, the temporary LAF can be renamed as the source LAF.4Figure 1:3 shows the pseudo-code for the stripmined program (assuming perprocessor available memory as 10). There are two stripmined iterations, eachiteration reads the initial data from the source le into an in-core local array(ICLA) temp and writes the intermediate results from an ICLA temp1 to thetemporary le. Each iteration, after reading the ICLA, performs communica-tion (if required). For example, in the rst iteration, processors 0,1 and 2 sendelements a(16),a(32) and a(48) to processors 1,2, and 3 respectively. In thesecond iteration, processors 1,2, and 3 send elements a(17), a(33) and a(49)to processors 0, 1, and 2. Note that this is an example of the Receiver-drivenIn-core communication method [Bor96].Figure 1:4 shows the initial communication and input/output placement. Thecommunication and input/output sets for each processor are given in global name4 A more detailed description is provided in [Bor96].
0 1 2 3
0 1 2 3
Read 
Comm
0 1 2 3
enddo
do j= lb,  ub
Write
0 1 2 3
Read 
Comm
0 211 23
48
17 33 49
0 21 321
Write
16 32
(1)
2 16 6332 48
i=1
i=2
0 1
(3) (4)
do i= 1,2
enddo
do j= lb, ub
enddo
Perform Commucation
temp
Write In-core Array Section
Read In-core Array Section
temp1
(2)
!HPF$ DISTRIBUTE A(BLOCK) ONTO P
FORALL(i=2:63)
ENDFORALL
!HPF$ PROCESSORS P(4)
REAL A(64)
2 3
1
2
3
4
5
6
a(i)=(a(i-1)+a(i+1))/2
temp1(j)= (temp(j-1)+temp(j+1))/2
temp1(j)= (temp(j-1)+temp(j+1))/2
enddo
do j= lb,  ub
temp1(j)= (temp(j-1)+temp(j+1))/2
1:9 17:25 33:41 49:57
1:8 17:24 33:40 49:56
8:16 24:32 40:48 56:64
9:16 25:32 41:48 57:64Fig. 1. Example program.space while the bounds for the in-core computation are given in the local strip-mined space (i.e., lb=1 and ub=8). For example, Read 1:90 means that processor0 is reading elements a(1) to a(9), Comm 0 16! 1 represents communicationof element a(16) from processor 0 to processor 1 and Write 1:80 means thatprocessor 0 writing elements a(1) to a(8).From the computation pattern, it is easy to determine the communicationpattern for each stripmined iteration [Bor96]. For example, in the rst iteration,processor 0 needs to send element a(16) to processor 1. Since processor 0 doesnot have element a(16) in memory, however, it needs to read it from the LAFand send it to processor 1. Similarly, processors 2 and 3 need to read elementsa(32) and a(48) from their LAFs and send them to their respective destinations.These le reads are termed as extra since the read elements are not required forcomputation by the owner processor. In the second iteration, processors 1, 2,and 3 perform also extra le accesses to read elements a(17), a(33) and a(49)respectively. To prevent violation of FORALL semantics, old values of elements,a(17), a(33), and a(49), are read from the source LAF and communicated toappropriate processors. It should be observed that elements a(17), a(33) and
a(49) are brought into memory in the rst iteration and could be communicatedbefore or after they are overwritten; thus minimizing extra le accesses. Theexample also performs redundant reads of some elements. For example, in therst iteration, processor 0 reads elements a(1) to a(9), but writes modiedvalues of elements a(1) to a(8) while retaining the old set of elements, a(1)to a(9) in form of the temporaries.5 In the second iteration, processor 0 againreads the old values of elements a(8) and a(9). Therefore, these two reads arepartially redundant. These partially redundant reads can be eliminated if it ispossible to determine which elements can be reused across iterations.As observed before, for our running example, communication requires bothinter-processor communication (i.e., communication of in-core data) and le I/O.To improve the communication cost, it is very important to minimize the le I/Ocost (or the number of le accesses). The le accesses generated by the programcan be classied into: (1) Compulsory : These accesses are required to read andwrite in-core data and (2) Extra: These accesses are required for communicat-ing o-processor out-of-core elements. The le I/O cost can be reduced by (1)eliminating partially redundant compulsory le accesses and (2) minimizing ex-tra le accesses by communicating in-core data whenever possible. The secondoptimization requires reordering computation and placing the communicationcalls so that only in-core data is communicated [Bor96]. In an out-of-core ap-plication, the computation order is decided by the data access pattern, that is,by placement of the read/write calls. Therefore, to minimize overhead due to leI/O in communication, it is important that both communication and I/O callsare placed at appropriate positions.4 A Framework for Integrated I/O and CommunicationPlacementIn Section 3, we describe the compilation of an out-of-core FORALL statement.We observe that the implementation of out-of-core FORALL requires extra leaccesses during communication and a naive implementation results in readingredundant data. In this section, we propose an integrated I/O and communi-cation placement framework that exploits the DELAYED KILL property of theFORALL construct and applies the array access information for improving theoverall performance. Note that the indeterminacy in FORALL execution order,allows our framework to freely reorder in-core computations. Specically, ourframework reorders in-core computation such that communication would involveonly inter-processor communication. Consequently, all extra le accesses will beeliminated.4.1 The Correctness CriteriaOur integrated framework imposes the following correctness requirements:5 Note that temporaries are marked LIVE during the FORALL computation.
{ Safety : All data either communicated or read is used immediately.{ Suciency : Every in-core computation is preceded by an appropriate Readcall and each non-local reference is preceded by appropriate communication.{ Balance: For every Send, there is exactly one matching Recv. Note thatthis condition does not apply for Read.6In the presence of the DELAYED KILL type of computation, the denition ofSafety is considerably weakened. Hence, it is more appropriate to term it asWeakSafety. Note that Weak Safety and Suciency are applicable for both le accessand communication calls, while Balance is applicable only to communicationcalls. Therefore, our framework is able to take an unied approach for placing leaccess and communication calls while honoring their individual characteristics.4.2 Eliminating Extra File Accesses in CommunicationIt should be observed that extra le accesses are generated because an arraysection7 is used several times in the stripmined FORALL iterations; once by theprocessor that owns the section and in remaining cases, by other processors. Ifit is possible for the processors to perform computation on the common arraysection in the same iteration, the communication will involve only inter-processordata transfer and extra le accesses could be eliminated. To satisfy this condition,we add the following constraint in the correctness criteria.Strict Safety Constraint{ Strict Safety : Everything that is read or communicated (i.e., sent and re-ceived) will be used only once.Criteria Safety and Strict Safety require that the data read by processor i atnode n, Incorein, should be used immediately and should not be used anywhereelse in the computation. Computation in any processor, j, at node n0, whichrequires elements of Incorein (in other words, Recvjn0  Incorein), should,therefore, be placed at node n. Then, processor i needs to send only the incoredata (Sendin  Incorein). Applying this condition to every processor, we canobserve that if node n satises Strict Safety, Commn is subsumed by Incorenand therefore, set Eio is empty and all extra I/O is eliminated.Processors i; j satisfying the above requirements exhibit one or both of thefollowing inclusion properties{ Recvjn0  Sendin ! Recvjn0  Incorein{ Recvin  Sendjn0 ! Recvin  Incorejn0where n and n0 are nodes of the interval ow graph denoting the initialplacement of the computation (in other words, placement of Read calls). To6 We currently use synchronous I/O calls.7 An element can be considered as a special case of section.
nd i; j and n; n0, it is necessary to perform both Forward and Backwardow analysis.Let us now dene a predicate Incl ij(n; n0) as follows:{ Incl ij(n; n0) df= True if Recvjn0  Incorein or Recvin  Incorejn0For a processor i, the solution of the Incl ij(n; n0), for any processor j (j 6=i), gives the node pair (n; n0) satisfying the inclusion properties. The inclusionproperty is then veried for every Incore and Recv set in the program. If allthe Incore and Recv sets satisfy the inclusion property, then the computationis said to be balanced. For balanced computation, one can eliminate extra I/Oby reordering computations.We illustrate this optimization by using our running example (Figures 1).Table 1 illustrates the values of various dataow variables corresponding to thestripmined iterations (Figure 1). There are two stripmined iterations; for eachiteration, Incore gives the set of elements that are brought in memory by eachprocessor (ICLA). Corresponding Active, Send and Recv sets are also shown.Table 1. Dataow Variables for the running example.Iter. Processor Node Incore Active Sent Recv1 0 2 1:9 1:9 16 -1 2 17:25 16:25 32 162 2 33:41 32:41 48 323 2 49:57 48:57 - 482 0 6 8:16 8:17 - 171 6 24:32 24:33 17 332 6 40:48 40:49 33 493 6 56:64 56:64 49 -Table 2 presents the solutions for the Incl predicate for all processors inform of the Inclusion matrix. An entry (n; n0) in a position [i; j] denotes thepair of nodes of the interval ow graph satisfying the inclusion equations forthe processors i and j. This entry is called as a solution entry. In other words,it denes the Incore sections of processors i and j that satisfy the inclusionproperty. For example, consider the solution at position [2,1]. The solution tuple(2,6) denotes that Recv22  Incore16, i.e., the data required by the ICLA ofprocessor 2 at node 2 (rst stripmined iteration) is part of the ICLA of processor1 at node 6 (second stripmined iteration). The entries in the positions [0,0] and[3,3] denote that processors 0 and 3 do not perform communication at nodes 2and 6 respectively (in other words, in the rst and second stripmined iteration).Such entries are called non-solution entries. The number of solution entries in
ith row or jth column denotes the number of times a processor i or j performscommunication.Table 2. Inclusion matrix for the running example.Processor Processor0 1 2 30 (2,2) (2,6) - -1 (2,6) - (6,2) -2 - (2,6) - (6,2)3 - - (2,6) (6,6)The information provided by Table 2 can be used to reorder the computa-tion. This reordering is an iterative procedure; every iteration tries to schedulecomputation such that the inclusion equations are satised. The iterations stopwhen all ICLAs represented by the solution tuples are scheduled. Let us un-derstand the reordering procedure using our running example and its inclusionmatrix.1. In the rst step, choose a random processor i. For our problem, let us chooseprocessor 2. For this processor, select a solution entry from the second row,e.g., entry [2,1] which corresponds to the solution tuple (2,6). It states thatRecv22  Incore16. Therefore, sections of local arrays of processors 2 and1, corresponding to the nodes 2 and 6 (in the interval ow graph) should bebrought in memory.2. In the second step, using the inclusion matrix, determine if the ICLA of pro-cessor 1 requires any o-processor data. It can be easily found out by checkingthe rst row of the inclusion matrix for solution entries containing node 6.The entry [1,2] corresponds to the solution tuple (6,2), which indicates thatRecv16  Incore22. Note that the array section of processor 2, correspond-ing to node 2, is already in memory. Therefore, the communication betweenprocessors 1 and 2 will involve only inter-processor communication.3. The rst two steps have scheduled ICLAs of processors 1 and 2. The thirdstep tries to schedule ICLAs of the remaining processors so that there are noextra I/O accesses. Consider processor 0. In the 0th row, the only solutionentry involves processor 1 at node 2. Since ICLA of processor 1 at node2 is already scheduled, this entry cannot be used. In this case, the non-solution entry, i.e., entry at position [0,0], [2,2], should be used. This non-solution entry suggests that the ICLA of processor 0 at node 2 does notrequire communication and therefore, can be scheduled along with ICLAs ofprocessors 1 and 2. Applying the same principle to processor 3, we can seethat ICLA of processor 3 at node 6 does not require communication. Hence,
this ICLA can be scheduled along with the ICLAs of processor 0, 1 and 2.For this ICLA schedule, only communication required will be inter-processorcommunication between processors 1 and 2.4. Applying the same procedure, the remaining four ICLAs can be scheduled.This ICLA schedule will involve interprocessor communication between pro-cessors 0 and 1, and between processors 2 and 3. Therefore, the overallcomputation involves only inter-processor communication and the extra I/Oaccesses are eliminated. Figure 2:A illustrates the nal placement of I/Oand communication calls. Figure 2:B illustrates an alternative placement.This placement is obtained using a dierent choice of initial processor.
(a) (b)
do j= lb,  ub
temp1(j)= (temp(i-1)+temp(i+1))/2
enddo
Send
1 2
33
Recv
1
33
2
do j= lb,  ub
temp1(j)= (temp(i-1)+temp(i+1))/2
enddo
0 1 2 3
Write
0 1 2 3
RecvSend
16
0
17
1
48 49
2 3
1617 4849
do j= lb,  ub
temp1(j)= (temp(i-1)+temp(i+1))/2
enddo
do j= lb,  ub
temp1(j)= (temp(i-1)+temp(i+1))/2
enddo
0 1 2 3
Write
0
Read 
1 2 3
0 1 2 3
Write
1-9
0
Read 
1
24-32
2 3
0
Read 
1 2 3
0 1 2 3
RecvSend
16
0
17
1
48 49
2 3
1617 4849
Send
1 2
33
Recv
1 2
0
Read 
1 2 3
0 1 2 3
Write
32 32
32
i=1
i=2
3233
0 1 2 3 0 1 2 3
CommReuse
: : 33:41 56:64 8:1617:25 40:48 49:57
1:8 25:32 33:40 57:64 9:1617:24 39:48 49:56
10:16 17:23 42:48 49:55 1:7 26:32 33:39 58:64
9:16 17:24 41:48 49:56 1:8 25:32 40:58 57:64
Fig. 2. Final placement of I/O and communication calls.5 Applying Dataow Framework to Stencil ProblemsWe now apply the communication and I/O placement framework to the stencilproblems. We illustrate using the 5- and 9-point stencils (Figure 3 (1) and (2)).
(1) (2)Fig. 3. 5- and 9-point StencilsThis section presents performance results of hand-coded out-of-core examplesthat use 5-and 9-point stencils. The experiments were performed for square realarrays of size 8K*8K (aggregate le sizes 256Mbytes), distributed in BLOCK-BLOCKfashion over processors logically arranged as a square mesh. These experimentsare performed using 16 and 64 nodes of an Intel Paragon.Tables 3 present performance results for column, and square tiles. In eachexperiment, the amount of time required to read and write local data, LIO, andthe time required for performing communication, COMM, were measured forunordered and ordered (after placing the I/O and communication calls) accesspatterns and the communication gain was computed. Each table presents LIOand COMM for 5- and 9-point stencils with dierent processor grids and dierentarray sizes. Since the local computation time is negligible compared to LIO, wehave not reported the computation cost. Each experiment was performed for thememory ratio of 14 (i.e., the ratio of size of available memory to that of out-of-core array). Note that for the unordered cases, COMM includes the cost ofinter-processor communication and extra le I/O.From Table 3, we can observe that by reordering communication and I/Ocalls, the communication cost COMM is signicantly reduced. For example, for a9-point stencil problem running on 64 processors using 8K*8K array and columntiles, COMM without ordering is 2.06 seconds, and with ordering is 0.05 seconds(therefore, the communication gain is 39). For the same problem, if square tilesare used, the communication gain is 35992. This increase in the gain is due tothe additional I/O cost incurred during accessing square tiles.6 ConclusionsIn this paper, we described a framework for optimizing communication and I/Ocosts in out-of-core problems. We focussed on communication and I/O optimiza-tion within a FORALL construct. We showed that existing frameworks do not ex-tend directly to out-of-core problems and can not exploit the FORALL semantics.We presented a unied framework for the placement of I/O and communicationcalls and applied it for optimizing communication for stencil applications. Usingthe experimental results, we demonstrated that correct placement of I/O andcommunication calls can completely eliminate extra le I/O from communicationand as a result, signicant performance improvement can be obtained.
Table 3. Performance of the 5-and 9-point stencils. time in seconds.Unordered Ordered CommMemory Procs. COMM LIO COMM LIO Gainratio a b c d e=(a/c)5-point Stencil, Column Tiles, 8K*8K Array1/4 16 1.38 7.57 0.04 6.71 35.381/4 64 1.16 10.87 0.05 9.80 25.219-point Stencil, Column Tiles, 8K*8K Array1/4 16 1.27 7.70 0.03 7.05 38.481/4 64 2.06 10.20 0.05 10.15 39.845-point Stencil, Square Tiles, 8K*8K Array1/4 16 175.11 183.96 0.03 178.14 5506.601/4 64 192.2 197.57 0.005 196.40 36061.799-point Stencil, Square Tiles, 8K*8K Array1/4 16 150.78 175.88 0.03 182.01 4569.091/4 64 192.2 197.57 0.005 196.40 35992.51AcknowledgmentsThe work of R. Bordawekar and A. Choudhary was supported in part by NSFYoung Investigator Award CCR-9357840, grants from Intel SSD and in part bythe Scalable I/O Initiative, contract number DABT63-94-C-0049 from AdvancedResearch Projects Agency(ARPA) administered by US Army at Fort Huachuca.R. Bordawekar is also supported by a Syracuse University Graduate Fellowship.The work of J. Ramanujam was supported in part by an NSF Young InvestigatorAward CCR-9457768, an NSF grant CCR-9210422 and by the Louisiana Board ofRegents through contract LEQSF(1991-94)-RD-A-09. This work was performedin part using the Intel Paragon System operated by Caltech on behalf of theCenter for Advanced Computing Research (CACR). Access to this facility wasprovided by CRPC.References[Bor96] Rajesh Bordawekar. Techniques for Compiling I/O Intensive Parallel Pro-grams. PhD thesis, Electrical and Computer Engineering Dept., Syracuse Uni-versity, April 1996.[GSS95] Manish Gupta, Edith Schonberg, and Harini Srinivasan. A Unied Frameworkfor Optimizing Communication in Data-Parallel Programs. IEEE Transactionson Parallel and Distributed Systems, 1995.[Hig93] High Performance Fortran Forum. High Performance Fortran Language Spec-ication. Scientic Programming, 2(1-2):1{170, 1993.
[KN94] Ken Kennedy and Nenad Nedeljkovic. Combining Dependence and Data-FlowAnalyses to Optimize Communication. Technical Report CRPC-TR94484-S,CRPC, Rice University, September 1994.[KS95] Ken Kennedy and Ajay Sethi. A Constraint Based Communication Place-ment Framework. Technical Report CRPC-TR95515-S, CRPC, Rice Univer-sity, February 1995. Revised May 1995.
This article was processed using the LATEX macro package with LLNCS style
