Access to streams in multiprocessor systems by Valero Cortés, Mateo et al.
Access to Streams in Multiprocessor Systems 
Mateo Valero, Montse Peiron and Eduard Ayguadk 
Department d' Arquitectura de Computadors, Universitat Pol i thica de Catalunya 
Gran Capita dn, Mbdul D4'08034 - Barcelona (Spain) 
Abstract 
When accessing streams in vector multiprocessor 
machines, degradation in the interconnection network and 
conflicts in the memory modules are the factors that reduce 
the eficiency of the system. In this paper we present a 
synchronous access mechanism that allows conflict-free 
access to stream in a SIMD vector multiprocessor system. 
Each processor accesses the corresponding elements out- 
oforder in such a way thut in each cycle the requested 
elements do not colllde in the interconnection network. 
Moreover, memory modules are accessed so that conflicts 
are avoided. 
The use of lhe proposed mechanism in present 
architectures wouM allow conflict-pee access to streams 
with the most c o m n  strides thut uppear in real 
applications. The additional hurdwurc is described and 
shown to be of similar cotnplexitv as that required for 
access in order. 
1. Introduction 
The access to streams in vector multiprocessors is one 
of the factors that most reduces the efficiency of the system 
(by the term stream we denote a finite length succession of 
elements whose addresses are equally spaced). 
Degradation in the interconnection network and conflicts in 
the access to memory modules lead to this loss of 
efficiency. The problem is hard to solve and in fact is one 
of the factors that limit the scalability of these systems. 
In order to increase the efficiency of the access, several 
techniques have been presented in the literature such as the 
proposal of storage schemes other than interleaving 
(skewing [l] and linear transfonnations [2]), the increase of 
the number of memory modules, and the use of buffers in 
the interconnection network and at the input and output of 
each memory module [3,41. 
The performance evaluation of the memory system is 
difficult to do using mathematical models (some works in 
this direction are [5,6,7]). Other authors use simulation 
methods based on real or synthetic traces to evaluate the 
1066-6192/92 $3.00 0 1992 IEEE 
system [8]. Measurements in real systems are done in [91. 
In [lo] a technique to access streams in multiprocessor 
systems is presented. The basic idea is that processors 
access different slices of a stream in a synchronized way. 
Data are unscrambled from memory to processors avoiding 
conflicts in the interconiiectioii network aid memory 
modules. In order to use this technique, all the addresses of 
the vector elements have to be precalculated before starting 
the access. 
With the aim of improving the access to streams in 
vector uniprocessors, an out-of-order access to the 
elements of a stream was proposed in [ l  11. It considers a 
bus-based system. The basic idea is to request stream 
elements in an order such that memory conflicts are 
avoided. This technique, applied to real systems, leads to 
conflict-free access to streams with the usual values of 
strides. In [I21 the same idea is applied to access, in a 
conflict-free way, streams with power-of-two strides. 
In this paper we use the techniques presented in [ l  11 to 
efficiently solve the problem stated in [lo]. The method 
described here allows the same number of conflict-free 
families but it does not need to precalculate the addresses 
before starting the access; the access is performed in a 
conflict-free way without any additional latency. On the 
contrary, we need two address generators and additional 
control. The cost of the hardware is shown to be of similar 
complexity as the required for access in order. 
2. Model architecture 
Our work is based on a vector multiprocessor SIMD 
architecture. Figure 1 shows the structure of the system: it 
is composed of P = 2P vector processors and M = 2"' 
memory modules grouped in 2' sections, so there are 2m-S 
modules per section. Processors are connected to sections 
through a 2P-input, 2'-output multistage interconnection 
network; The memory latency is T=2' processor cycles. We 
assume s = p and an Omega interconriection network [13]. 
The memory system is matched, i.e., M = P.T (so m = s + t). 
310 
- -  
3 1 
2 
9 3 m 
7 ;;:a sectiono processor 0 
Intercon. 
8 40 72 104 136 168 200 232 
16 48 80 112 144 176 208 240 
24 56 88 120 I52 184 216 248 
processor 2P- 1 N e t w o r k ] r h  section 2s- 1 
Figure 1: Structure of the SIMD multiprocessor system. 
In a SIMD system, each processor requests one element 
per cycle unless it has to wait due to collisions in the 
interconnection network or in the memory modules. The 
length of the vector registers of each processor is L = 2'; 
we assume h 2 t. The first element of the stream has address 
& and consecutive elements are separated by a constant 
value S (the stride) so that the i-th element has address 
&+S.(i-l). As done in [14], we classify the strides in 
families defined by x so that all strides 0.2' with (J odd 
belong to the same family. The length of the stream is L1 
with L1 = P.L = 2'+s = 2'1 (since h 2 t, L1 = k.M for some 
k > 0). 
Because the memory is organized in several sections 
and each section in several modules, an address mapping is 
required which transforms the physical address A with 
binary representation ~ ? ~ . l a ~ . ~ . . . a ~ q ,  into a tuple (section, 
supermodule, displacement). The tenn supennodule refers 
to the module number within a section. In this paper the 
address inapping is done using the block-interleaved 
storage scheme shown in figure 2. A field o f t  bits located 
at position specifies the supennodule that is accessed 
within a section; a field of s bits located at position c1 (cl = 
co+t) specifies the section. The rest of the bits indicate the 
displacement in the module. In the next section we explain 
how we determine the values for c1 and CO. 
4 L A  'In-s = 
the sequence of tuples (ml, ..., m d  where m, is formed by 
the P memory-module numbers accessed by the processors 
in the i-th request. Note that the elements of a stream can be 
requested in any order. The CANONICAL temporal 
distribution is the temporal distribution when the i-th 
element of the stream is accessed by processor ((i-1) mod 
P) and each processor request the elements in order. 
c*=5 c0=3 
section 
0 . .  7 2 3 4 5 6 1  a-= 64 96 128 160 192 224 
3. Conditions for a conflict-free access 
In the synchronous model that we are considering, P 
requests are sent to memory in each cycle; the following 
conditions are required for a minimum-latency (conflict- 
free) access: 
1.- P simultaneous requests must not collide in the 
interconnection network (if they do not collide, 
they will go to different sections and hence to dif- 
ferent modules). 
2.- Consecutive accesses to a memory module have 
to be sepmted T cycles. 
To fulfil condition 1 we use the properties of the Omega 
interconnection network. As proven in 1131, a set of input- 
output connections of the form 
((a.x+b) mod N, (c.x+d) mod N) 
does not collide in a NxN Omega network if gcd(a, N) I 
gcd(c, N) for 0 2 x < a being a I N/gcd(c,N) is satisfied. 
The values a = 1, b := 0, c odd and any d lead to a not- 
where the output desired by input i is the output desired by 
Figure 2: address mapping used in colliding conllectioll pattern; this comspon& to a pattern this paper. 
Figure 3 shows the mapping of addresses to memory 
modules when m=S, s=3, t=2, CO=? and cl=5. It shows the 
mapping of the first 256 addresses. Each number i 
represents a block of 8 elements (elements from i to (i+7)). 
We now define the spatial and temporal distributions of 
the elements of a stream, since they determine if the access 
can be done in a conflict-free way. The SPATIAL 
distribution of a stream in the multi-module memory is the 
M-tuple SD, where SD(i) is the number of stream elements 
in module i. The TEMPORAL distribution of a stream is 
input i- 1 plus an odd value modulo 2'. Such a pattern is the 
one we will apply. 
Condition 2 is equivalent to state that P-T consecutive 
requests must visit P.T = M different memory modules. 
Since L, = k.M, a necessary condition for this is that all 
memory modules have to contain the same number of 
stream elements; we say that a spatial distribution of a 
stream is BALANCED if it satisfies this condition. 
Because of this, to determine conditions for a conflict-free 
temporal distribution, we first determine conditions for a 
311 
balanced stream and then consider access orderings so that 
condition 2 is fulfilled. 
Note that the canonical temporal distribution for a 
stream with x=cl fulfils condition 1. In this case, 2" 
consecutive elements of the stream do not collide in the 
interconnection network because they have different values 
in bits &l+s..&l and the difference in value of these bits for 
two consecutive elements is (a mod 2"). On the other side, 
only streams with a stride of the family x=c, fulfil 
condition 2 when accessed with their canonical temporal 
distribution. So in conclusion, it is not possible to obtain a 
minimum latency access when a stream is accessed with its 
canonical temporal distribution with the proposed address 
mapping. 
Lemma: For the address mapping considered, a stream 
with any initial address A, length L1=2" and stride 
S=0.2' is balanced if and only if x I co and hl 2 co + m - x. 
Proof: 
Necessary condition: to have a balanced strean it is 
necessary that all memory modules are visited by the 
stream. This requires that the addresses of the elements of 
the stream have different combinations in bits cO+m- 1 ,.CO. 
If x > co this is not true since bits X..CO remain unchanged. 
If hl < co + m - x, then there exists at least one instance of 
a srream that is non-balanced (e. g. a stream with initial 
address & = 0 and stride 1.2' has its elements not mapped 
in all the memory modules). 
Sufficient condition: if x I co and hl = c, + in - x + a (for 
some a 2 O), the elements of the stream have addresses 
& + k ~ . 2 ' ,  k = 0, 1, ..., 2'"'-1 
Since a is odd, the values k . 0 - 2 ~  make 2" different 
combinations to appear in the bits h,+x..x; therefore, every 
combination of the bits %+in- 1 ..co appears 2c0-x+B times; as 
From this property, making co 5 hl - in + x we obtzzin 
balanced streams of length 2'"' for x 5 co. It is desirable to 
obtain conflict-free access to stream with strides of the 
family x = 0 (odd strides), sincc they are the most frequent 
in real programs; so, we will force that co I hl - m. Making 
CO = hl - m (and therefore c1 = h) we obtain the largest set 
of families that produce balanced streams. 
LJp to now we have shown the conditions for a stream to 
be balanced. However, the access in order can lead to a 
high latency because of an unsuitable canonical temporal 
distribution. For instance, consider again the exxnple 
shown in figure 3 and an access with stride S=6 (family x=l 
'and 0=3) and b 7 4 .  If the elements are accessed in order, 
the following succession of supennodules mid sections is 
produced: 
a consequence, the stream is balanced. D 
PO p1 p2 p3 p4 PS p6 p7 
s m . : 1  2 2 3 0 1 1  2 
sec.: 2 2 2 2 3 3 3 3 
s m . : 3  0 0 1 2  3 3 0 
sec.: 3 4 4 4 4 4 4 5 
... ... 
Observe that conflicts are produced both at the 
interconnection network and at the memory modules. In the 
next section we show how to reorder the access of the 
stream elements so as to achieve a better temporal 
distribution. 
4. Access method 
To avoid collisions in the interconnection network we 
force that, in every cycle, if processor Pi requests an 
element with address A, processor Pi+' will request an 
element with address A + 0.2'~ for all i=O ... 2'-2; in this 
way, while processor Pi is accessing section j = LN2''J 
mod 2' processor Pi+ is accessing section (i+a) mod 2' and 
condition 1 is fulfiled. To guarantee condition 2, T = 2m-S 
consecutive requests of each processor must be mapped 
into different modules, i.e., the bits c,+t-l..co must be 
different. 
The mechanism to achieve these properties is as 
follows: 
1) split the stream in 2' subvectors of 2'"l-' consecutive 
elements each. The first element of subvector i is vi = 
2) partition each subvector in 2" slices of 2"-' consecu- 
tive elements each. Slices are assigned to processors 
as shown in the following figure. 
h, 
i.2A1-x. 
* 2hl-x F x  
?=. : elements assigned to processor 0 
\\\' : elements assigned to processor 1 
K% : elements issigned to processor 2'- 1 
Each processor will access its slices sequentially. 
The addresses of the q* elements of any consecutive 
slices differ by 6-2" for all q; hence, if all processors 
follow the sane ordering to access the elements of its 
slices, there will never be collisions in the intercon- 
nection network . 
the elements in a slice are accessed in the following 
way. Elements separated by 2c0-x form a sequence of 
T elements that are mapped into different modules 
3) 
312 
h,+x L CO 
A: 
T ' I  4 X 
PO Pl ... p25- 1 
subvector 0: 
sequence 0: 
0 
2co-x 
(2'-1).2C04 
sequence 1: 
1 
1 + 2co-x 
1 + (2'-1).2co-x 
subvector 1 : 
sequence 0: 
2h1-x = V I  
(2'-I).2co-x + v1 
Subvector 2'- 1 
sequence 0: 
(2X- 1).2+ = v2x- 1 
sequence 2c0-x- I : 
2co-x-1 + v2x.1 
2CI -x 
200-x + 2Cl-x 
1 + 2c1-x 
1 + 2cc1-x + 2Cl-x 
... 
... 
... 
... 
... 
... 
... 
2c'-x + V I  ... 
(2L ,>.2"0-X + 2CI-X -I- v1 ... 
....................... .... ................................................... 
..................................................... I ......................... 
2c'-x + v2x-1 ... 
1 + (2'-1).2W-X + (2s- 1).2c'-x 
(2s-1).2c'-x + v1 
(2t-1).2"-X + (2s-I).2c'-x + V I  
(2s- I).2c'-x + v2x- 1 
time 
Figure 4 Accessing method. 
313 
because their addresses differ by 0.2'~. They are re- 
quested consecutively, so conflicts are avoided at the 
memory modules. The following graphic shows the 
sequences and elements that belong to each sequence 
in one slice: 
9 C I . X  - 
0 1 2  ... 0 1 2  ... 0 1 2  ... 0 1 2  ... 
I l l  I I I I I  I I l l 1  I I l l 1  I 
I " '  I " '  I " '  - 2co-x * 
There are 2"-' sequences and they are accessed 
sequentially. 
Figure 4 shows the elements of some sequences. 
Next we consider again the example of figure 3 .  If we 
suppose h = 5, then c1 = 5 and co = 3 is the best choice, as 
explained in the previous section. A conflict-free access is 
possible for streams with any initial address, length = 256 
(A1 = 8) and stride S = 0.2' with x = 0, 1,2 or 7. 
c1=S c,=3 
A: 
t, 
Let x = 1. The stream is splitted in two subvectors 
(elements 0 to 127 and 128 to 255). Each subvector is 
splitted again into 8 slices of 16 consecutive elements each. 
Within each slice there are 4 sequences of4 elements each. 
The sequences and the access ordering are the following: 
Po PI P3 PS 
0,4,8,12 16,20,24,28 32,36,40.44 48,52,56,60 
l,S>l,l3 17,21,25,29 33,37.41,45 49,53,57,61 
2.6,10,14 18,22,26,30 34,38.42,46 50,54,58,62 
3,7,11,15 19,23,27,31 35,39.43,47 51,55,59,63 
128,132,136,140 ... ... 
p4 p5 P6 p7 
64,68,72,76 80.84,88,92 9G,100.104,108 11 2,116,120,124 
65,69.73,77 8 I ,8S,89,93 97,101.1 OS, 109 1 13,lI 7,121.125 
66,70,74,78 82,86,90.94 98,102,106,110 1 14,118,122,126 
67,71,75,79 83,87,91,95 99,103.1 07,111 1 IS, 119,123,127 
192,196,200,204 ... ... 
A stream with initial address A0 = 74 auld o = 3, for 
instance, causes the following succession of supermodules 
and sections: 
sin.: 1,0,3,2 1,0,3,2 1,0.3,2 1,0,3,2 
sec.: 2,3,3,4 S,6,6,7 0,  I. 1.2 3.4.4,s 
sin: 2.1,0,3 2,1,0,3 2.1.03 2.1.0.3 
sec.: 2,3,4.4 S,6,7.7 0.1.2.2 3.4.5,s 
PO P1 p2 p3 
p4 PS p6 p7 
sin.: 1,0,3,2 L,O,3,2 1 , 0 3 2  1,0,3,2 
sec.: 6,7,7,0 1,2,2,3 4,5,5,6 7,0,0,1 
sm.: 2,1,0,3 2,1,0,3 2,1,0,3 2,1,0,3 
sec.: 6,7,0,0 1,2,3,3 4,5,6,6 7,0,1,1 
... ... ... ... 
The control to perform the requests in the order 
proposed in each processor Pi is as follows: 
A,,,=A=A, + i-G-2c1 ;initial address 
€or K = 1 t o  2x ;subvectors 
for J = 1 t o  2c0-x ;sequences 
f o r  I = 2 to 2t ;access a se- 
quence 
A = A + 0 * 2 c 0  
end for 
if J<2c0-x then ;next sequence 
An,u=A=As,u + 0 * 2 x  
end if 
end f o r  
A,,u=A=A,,u + 0- (2h1-2c0) ;next subvector 
end f o r  
Figure 5 shows the hardware required to generate the 
addresses as explained in this section. 
To simplify the impleinenttltion we have considered the 
compiler generates instructions to load the values 
Ao+i.o.2C' (initial address of the first slice accessed by 
processor i), and 0.2', 0.2", 0.(2"-2~) and 2&-' as well. 
Notice that in the special case of Ll = M, each processor 
accesses only one slice of T consecutive elements of the 
stream, instead of accessing several discontinuous slices. 
The address inapping proposed leads to CO = 0, and 
therefore conflict-free access is obtained just for vectors 
with stride belonging to the family x = 0. There is only one 
sequence in each slice, containing the following elements: 
Po 0 1 ... 2C' - 1 
p, 2C'-x =2C' 1 + 2C' ... 2c1-1 + 2Cl 
P2", (25-1).2Cl 1 + (2S-l).2C' ... 2c'-1 + (2s--L).2c' 
However, the sane result can be obtained for any x > 0 
forcing co = x. The idea of using a dynamic storage scheme 
presented in [ 141 for the uniprocessor system case could be 
applied to obtain a conflict-free access to streams with a 
stride belonging to any single family. 
5. Additional reordering 
The previous example shows that although one 
sequence can be accesed in a conflict-free manner, there 
may be conflicts between different sequences. For instance, 
in the 5" cycle P, requests an element located in the scme 
3 14 
address A AO+ia.2'' 
1=2'-1 
1 
K=L1-1 
(l#2'-1 v J=O) A K#Ll-l 
. 1=2'-1 A J#O A, K#L1-1 
1=2'-1 A J=O A K#L1-l 
l=2t-1 A J#O A KfL1- l  122'-1 
address A 
L- 1 
*CO-x -1 
init v 1=0 
1 
Figure 5: Hardware for address calculation 
module as the one requested by P3 in the 4h cycle, which is 
still occupied. 
For a global conflict-free access, the supennodule 
visited by sequence k in its qth cycle must be the same as 
the one visited by sequence k+l in its q~ cycle; by now, our 
method does not guarantee the fulfillment of this 
condition. In the example considered above, for instance, 
conflicts would be avoided if the elements of each 
sequence were sent in the supennodule order (1,0,3,2), 
defined by the first sequence. 
To achieve this, we propose to decouple the calculation 
of the addresses from the actual requests. This is achieved 
by precalculating the addresses of sequence i+l while 
accessing sequence i. Consequently, during the first 2t 
cycles, it is necessary to calculate the addresses of the first 
sequence (which are used immediately for memory access) 
and of the second sequence (which are stored in a set of 
latches for access as the next sequence). After that, for each 
sequence, the addresses for access are obtained from the 
latches and the addresses of the next sequence are 
precalculated to store. Consequently, as shown in figure 6, 
two address generators are needed, although one of them is 
only used in the first 2' cycles. Moreover, it is necessary to 
store the order of supennodules accessed by the first 
sequence, which is used to control the order of the requests 
of the following sequences. 
-- 
I IJ 
address 
generators 
sequence1 I I other sequences 
Figure 6: Architecture model for out-of-order memory 
accesses. 
6. Conclusions 
When accessing streams in vector multiprocessor 
machines, degradation in the interconnection network and 
conflicts in the memory modules are the factors that reduce 
the efficiency of the system. hi this paper we have 
presented a synchronous access mechanism that allows 
315 
conflict-free access to streams in a SIMD system.The work 
is a continuation of the method presented in [ll], where c a n  
out-of-order access to the elements of a stream in a bus- 
based vector uniprocessor was proposed.The basic idea 
was to request stream elements in such an order that 
memory conflicts are avoided. 
Each processor accesses the corresponding elements 
out-of-order in such a way that in each cycle the requested 
elements do not collide in the interconnection network. 
Moreover, memory modules are accessed in a way that 
avoids conflicts. The same problem was tackled in [lo] 
although the solution in this case needed to precalculate the 
addresses before starting the access; our solution to the 
problem avoids this precalculation so the access is 
performed in a conflict-free way without any additional 
latency. On the contrary, we need two address generators 
and additional control. 
The use of the proposed mechanism in present 
architectures would allow conflict-free access to streams 
with the most common strides that appear in real 
applications. The additional hardware has been described 
aid shown to be of similar complexity as that required for 
access in order. 
We plan to extend this work so that each processor 
access a slice of L consecutive elements of the stream. With 
the aim of increasing the number of‘ conflict-free families 
we also plan to extend this work to the uninatched-memory 
case. 
7. Acknowledgements 
This work has been supported by the ESPRIT Basic 
Research Action 6634 APPARC, the Ministry of Education 
of Spain under contract TIC-880/92, and by the CEPBA 
(European Center for Parallelism of Barcelona). 
8. 
1. 
6. 
7. 
8. 
9. 
10. 
11. 
12. 
13. 
14. 
References 
P. Budnik and D. J. Kuck: “The Organization and Use of 
Parallel Memories”, IEEE Trans. on Computers, vol. C-20, 
no. 12, pp. 1566-1569, Dec. 1971. 
J. Frailong, W. Jalby and J. Lenfant: “XOR-schemes: A 
Flexible Data Organization in Parallel Memories”, Int. Conf. 
on Parallel Processing, pp. 276-283, 1985. 
M. Dubois et al.: “Memory Access Buffering in 
Multiprocessors”, Int. Symp. on Computer Architecture, 
K. A. Robbins and S. Robbins: “Bus Conflicts for Logical 
Memory Banks on a Cray Y-MP type Processor System”, 
Int. Conf. on Parallel Processing, pp. 1.21-1.24, 1991. 
W. Oed and 0. Lange, “On the Effective Bandwidth of 
Interleaved Memories in Vector Processing Systems”, IEEE 
Trans. on Computers, vol. C-34, no. 10, pp. 949-957, 
October 1985. 
D. H. Bailey: “Vector Computer Memory Bank Contention”, 
IEEE Trans. on Computers, vol. C-36, no. 3, pp. 293-298, 
March 1987. 
I. Y. Bucher and D. A. Calahan: “Models of Access Delays 
in Multiprocessor Memories”, IEEE Trans. 011 Parallel and 
Distributed Systems, vol. 3, no. 3, May 1992. 
T. Cheung and J. E. Smith: “A Simulation Study of the Cray 
X-MP Memory System”, IEEE Trans. on Computers, vol. C- 
35, no. 7, pp. 613-622, July 1986. 
J.E. Smith and W.R. Taylor, “Characterizing Memory 
Performance in Vector Multiprocessors”, Int. Conf. on 
Supercomputing, pp. 35-44, 1992. 
A. Seznec and J. Lenfant: “Interleaved Parallel Schemes: 
Improving Memory Throughput 011 Supercomputers”, Int. 
Symp. on Computer Architecture, pp. 246-255, 1992. 
M. Valero, T. Lang, J.M. Llaberia, M. Peiron, E. Ayguade 
and J.J. Navarro: “Increasing the Number of Strides for 
Conflict-Free Vector Access”, Int. Symp. on Computer 
Architecture, pp. 372-381, 1992. 
M. Valero, T. Lang and E. Ayguade: “Conflict-Free Access 
of Vectors with Power-of-Two Strides“, Int. Conf. on 
Supercomputing, 1992. To be published. 
D.H. Lawrie: “Access and Alignment of Data in an Array 
Processor“, IEEE Trans. on Computers, vol. C-24, no. 12, 
pp. 11451 155, Dec. 1975. 
D.T. Harper 111 and D. A. Linebarger: “Conflict-Free Vector 
Access Using a Dynamic Storage Scheme”, IEEE Trans. on 
Computers, vol. C-40, no. 3, pp. 276-283, March 1991. 
pp. 422-434, 1988. 
316 
