Synchronized access to streams in multiprocessors by Peiron Guàrdia, Montse et al.
SYNCHRONIZED ACCESS TO STREAMS IN
MULTIPROCESSORS
Montse Peiron, Mateo Valero, Eduard Ayguadé and Tomás Lang∗
Departament d’Arquitectura de Computadors, Universitat Politècnica de Catalunya
Gran Capità s/n, Mòdul D4, 08034 - Barcelona (Spain)
Phone: + 34 - 3 - 401 71 88, Fax: + 34 - 3 - 401 70 55
email: montse@ac.upc.es
∗Department of Electrical and Computer Engineering, University of California at Irvine
Abstract
The synchronized and simultaneous access to several
vectors that form a single stream occurs in SIMD vector
multiprocessors as well as in MIMD superscalar
multiprocessors with decoupled access. In this paper we
propose a block-interleaved storage scheme and an out-of-
order access mechanism that allows  conflict-free access to
streams with an arbitrary initial address and constant stride
between elements. A maximal number of conflict-free
families including the most commonly used strides can be
obtained. We consider the use of a crossbar interconnection
network, although the method applies also for the case of a
multistage interconnection network.
Keywords
SIMD Vector multiprocessors, Multi-module memories,
Vectors with constant stride, Conflict-free access.
1.- Introduction
To have a sufficient memory bandwidth, the memory of
vector processors is organized as several modules that can
be accessed simultaneously. For a system with P ports, to
achieve the maximum throughput of P accesses per
processor cycle it is necessary to have at least P.T memory
modules, being T the latency of each memory module (or
the ratio between the memory cycle and the processor
cycle). In this context, a memory system is matched when
it is composed of exactly M = P.T memory modules and
unmatched when M > P.T.
To achieve this maximum throughput the stream access has
to be performed so that no memory conflicts occur. This
has been extensively studied for  vector uniprocessor
systems with a single memory port (P = 1). As summarized
in [1], storage schemes have been proposed to produce
either conflict-free access to vectors with some strides or
minimum average latency for uniform distribution of
strides. In this latter case, buffers can be added to achieve
high throughput for long vectors. Of particular interest to
this paper is the scheme proposed in [2], which increases
the number of conflict-free strides by accessing the vector
elements out of order.
In this work we are concerned with an extension of the
previous results to the case in which there are P ports (P
processors with one port each or fewer processors with
several ports per processor). We consider the special case
in which a single stream is divided equally among the ports
and the accesses of these ports is sinchronized, so that each
port requests one element per cycle. This mode of
operation is  reasonable in SIMD vector multiprocessors or
in MIMD systems with decoupled access, in which the data
can be accessed in this regular and synchronized manner,
but then used differently (for example in a scalar form). In
[3] a storage scheme called Interleaved Parallel Scheme
(IPS) is proposed, as well as an out-of-order access. This
allows a conflict-free access to streams with the most-
frequently used strides [4] when the interconnection
network is a universal multistage network (a Benes
Network). Each processor needs to precalculate the
addresses of all its vector elements and then the processors
send their requests in a synchronized manner.
We present here an alternative solution to this problem,
based on the techniques developped in [2]. We consider the
use of a crossbar interconnection network, although the
method applies also for the case of a multistage
interconnection network [5].This technique allows
conflict-free access to the same number of strides as in [3].
However, it does not require the precalculation of the
addresses of the stream nor the use of a Benes
interconnection network. We present the matched-memory
case, an extension to the unmatched memory case being
discussed in [5]. The technique is presented using a block-
interleaving storage scheme, but the same results are
obtained using either skewing or a linear transformation.
The method requires two address generators to advance the
 calculation of a few addresses. We estimate that the
additional hardware has a cost that is only a small fraction
of the cost of the processor and the memory system.
2.- Architectural model and conditions for a
conflict-free access.
We study the behavior of memory accesses in a
multiprocessor system with the structure shown in Figure
1. It is composed of  P = 2s ports and M = 2m memory
modules grouped in 2s sections, so there are 2m-s modules
per section; the latency of the memory modules is T = 2t.
The memory system is matched, i.e., m = s+t.
Figure 1: Structure of the system.
Ports are connected to sections through a 2s-input, 2s-
output crossbar interconnection network, and modules in a
section are connected by a single bus. This organization
allows the initiation each processor cycle of one access per
section, as long as the requested module is not busy with a
previous request. The maximum achievable throughput is
then P accesses per processor cycle after the initial
transient state.
The ports are controlled by vector load/store instructions
and interface with the processors through vector registers
of length  L = 2λ.
As shown in Figure 2, port i accesses vector Vi composed
of L consecutive elements of the stream. The stride of the
stream is S and the initial address is A0. As done in [6], we
classify the strides into families, where the family defined
by x is the set of strides S = σ⋅2x with σ odd.
Figure 2: Vectors in a stream.
The elements of a stream are distributed  among the
memory modules in a way that depends on the storage
...
...
section 0
section 2s-1
port 0
port 2s-1
2sx2s
Intercon.
Network
0 T-1
0 T-1
... ...
2λ
V0 V1 Vi V2s-1
el. i.2λ; address = A0 + i.2λ.Selement 0; address = A0
2λ1
scheme used; we say that the storage scheme defines the
spatial distribution of a stream [2]. The term temporal
distribution of a stream refers to the order in which its
elements are requested.
We now determine necessary conditions on the spatial
distribution to allow conflict-free access of the stream.
In the multiprocessor architectural model described above,
the following two types of conflicts prevent conflict-free
access:
a) memory module conflicts. These occur when a request
arrives to a module while it is busy.
b) section conflicts, which occur when two simultaneous
requests are for the same section.
A necessary condition to avoid memory module conflicts is
that each module does not contain more than L/P.T
elements of the stream. This is evident because for conflict-
free access L/P.T memory cycles are required, which is not
possible if a module has additional elements.
Similarly, to avoid section conflicts, each section has to
contain the same number of stream elements.
When a spatial distribution satisfies these two conditions,
we say that it is balanced. Consequently, our approach is to
use a storage scheme that produces balanced spatial
distributions for a large number of strides, including the
most frequent. Then, we look for conflict-free temporal
distributions for these cases.
3.- Address mapping and balanced streams
Since the memory is organized in several sections and each
section in several modules, an address mapping is required
which transforms the physical address A with binary
representation an-1,...,a0 into a tuple (section, supermodule,
displacement) (the term supermodule refers to the module
number within a section).
Figure 3: Storage scheme used in this paper.
We use a block-interleaved storage scheme as address
mapping, as illustrated in Figure 3. For the mapping
purposes, the address is divided into three fields as follows:
- the S-field of s bits, specifying the section;
- the M-field of t bits, specifying the supermodule;
- the rest of the bits of the address specify the
displacement inside the module.
c1
...
supermodule numbersection number
A:
c1+s
...
c0an-1
s t
a0
...S-field M-field
The two fields are located as shown in Figure 3. Note that
the M-field and the S-field should not intersect (if they do
intersect, some modules will never be visited), so c1≥c0+t.
The period of this transformation is 2c1+s.
Lemma
For the address mapping considered, a stream with initial
address A0, length L1 = 2λ1 and stride S = σ.2x is balanced
iff x ≤ c0 and λ1 ≥ c1+s-x (see [5] for the proof).
Since a balanced spatial distribution is a necessary
condition for conflict-free access we want an address
mapping that satisfies this condition for the maximum
number of families of strides. Moreover, since stride 1 is
very frequent, we include this stride and use the address
mapping of Figure 4, which produces balanced streams for
strides of the families x = 0, 1, ..., c0.
Figure 4: Storage scheme for a conflict-free access to families
0, 1, ..., c0.
c1
...A:
c1+s a0an-1
s t
λ
c0
...S-field M-field
4.- Conflict-free temporal distribution
For the balanced streams obtained, we now find conflict-
free temporal distributions. Two conditions have to be
satisfied:
C1.-The P simultaneous requests must not have section
conflicts.
C2.-Consecutive accesses to a memory module have to
be separated by T cycles.
One stream may have many conflict-free temporal
distributions; the access orderings that we propose satisfy
the following properties:
P1) Simultaneous accesses go to different sections (this is
necessary, and equivalent to C1).
P2) Simultaneous accesses go to the same supermodule.
P3) T consecutive accesses go to different supermodules.
P2 and P3 are a particular way of satisfying condition C2
and result in a simple hardware for address calculation [5].
We now determine an out-of-order accessing scheme that
satisfies these conditions for the balanced families of the
address mapping of Figure 4.
Figure 5: (a) Storage scheme. (b) Address mapping. (c) Mapping of the elements of a stream  with A0 = 4 and S = 4
section 0 1 2 3
supermodule 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
0 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120
1 9
7 15 23 31 39 47 55 63 71 79 87 95 103 111 119 127
128 136 144 152 160 168 176 284 192 200 208 216 224 232 240 248
one period
.
.
.
.
.
.
.
.
. .
.
.
section 0 1 2 3
supermodule 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
8 16 24 32 40 48 56 64 72 80 88 96 104 112 120
4 12 20 28 36 44 52 60 68 76 84 92 100 108 116 124
128 136 144 152 160 168 176 184 192 200 208 216 224 232 240 248
132 140 148 156 164 172 180 188 196 204 212 220 228 236 244 252
256 264 272 280 288 296 304 312 320 328 336 344 352 360 368 376
260 268 276 284 292 300 308 316 324 332 340 348 356 364 372 380
384 392 400 408 416 424 432 440 448 456 464 472 480 488 496 504
388 396 404 412 420 428 436 444 452 460 468 476 484 492 500 508
512
λ
x
(a)
V0
V1
V2
V3
Period of the storage scheme: 128
(b)
(c)
Note that access in order produces conflict-free access only
for the particular case of λ = t. Consider, for instance, s = t
= 2 and λ = 5; this results in c1 = 4 and c0 = 2 (Figure 5.a);
Figure 5.b shows one period of the address mapping. The
elements of a stream with A0=4 and S=4 (σ =1 and x=2) are
mapped as shown in Figure 5.c; we show the elements
belonging to each vector Vi as well. Observe that, for
instance, the elements k and (k+1) with k odd of each
vector are in the same memory module, so they cannot be
requested in successive cycles.
So, the goal is to reorder the access to the elements of the
stream to obtain a conflict-free temporal distribution.  To
achieve this we do the following:
1.Divide each vector into sequences of T elements (the
elements of a sequence are in general not consecutive
vector elements). The elements of the stream are
identified by the triple: vector number i (0 ≤ i ≤ P-1),
sequence number j (0 ≤ j ≤ L/T-1), and element number
k (0 ≤ k ≤ T-1).
2.The access is performed per sequence, i.e.,
for j ;sequences
for k ;elements of a sequence
access element (j,k) for all i simultaneously
To achieve conflict-free access of each sequence we define
the mapping of elements of the vectors into sequences as
described by Figure 6. Note that the division depends on
the stride family, and that sequences are identified by the
values of ju and jd.
Because of the limited space we consider only the case in
which x ≥ s (see [5] for the case x < s).
Figure 6: Element k of sequence j of vector Vi.
Port Pi accesses vector Vi in the following order:
for ju = i to (i+2x-1) mod 2x
for jd=0 to 2c0-x-1
for k=0 to T-1
access element (j,k)
In the example of Figure 5, the sequences are built as
shown in Figure 7.
In this way, the simultaneously requested elements have
addresses
for some α = 0...2x-1 and β = 0...2c0-x-1.
These P addresses are mapped in section
Aijk / 2c1 mod 2s =
This expression takes different values for the P values of i,
so property P1 is satisfied. Moreover, these addresses are
mapped into the supermodule
jdi
λ
λ1
ju k
c0-xtx
Ai,j,k = A0 + (i.2λ + k.2c0-x + j).σ.2x (j = ju.2c1-x + jd)
k
c1
...
a0an-1
λ
c0
...
s
x
(        i β ).σ + A0(i+α)mod2x
Aijk:
A0 kσ2c0 βσ2x+ +
2
c1
------------------------------------------------
i α+( ) mod2x σ⋅+
  
   mod2s=
Figure 7: Sequences in a stream with A0=4 and S=4 with L=32, P=4 and T=4.
section 0 1 2 3
supermodule 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
8 16 24 32 40 48 56 64 72 80 88 96 104 112 120
4 12 20 28 36 44 52 60 68 76 84 92 100 108 116 124
128 136 144 152 160 168 176 184 192 200 208 216 224 232 240 248
132 140 148 156 164 172 180 188 196 204 212 220 228 236 244 252
256 264 272 280 288 296 304 312 320 328 336 344 352 360 368 376
260 268 276 284 292 300 308 316 324 332 340 348 356 364 372 380
384 392 400 408 416 424 432 440 448 456 464 472 480 488 496 504
388 396 404 412 420 428 436 444 452 460 468 476 484 492 500 508
512
V0
V1
V2
V3
: first sequences : second sequences : third sequences
This expression is independent from i, so P2 is also
fulfilled. Accessing consecutive elements of the sequences
means giving T consecutive values to k in the above
expression; since the resulting values are different, P3 is
also satisfied and access to one sequence is conflict-free.
Although each sequence is accessed without conflicts, the
access of consecutive sequences might lead to conflicts in
the memory modules. In Figure 7, for instance,
supermodules are visited in the order <0,1,2,3> by the first
sequences and in the order <1,2,3,0> by the second ones;
all the request made in the fifth cycle collide with those
made in the second  cycle.
To solve these intersequence conflicts, we use two address
generators, as shown in Figure 8. During the first T cycles,
the addresses of the first sequence are calculated and used
for memory access; in addition, its supermodule order is
stored in a circular shift register . Meanwhile, the addresses
of the second sequence are calculated and stored in a set of
buffers. After that, for each sequence, the addresses for
access are obtained from the buffers under the control of
the shift register, and new addresses are calculated to store
(so, the second address generator works only during the
first T cycles).
The additional hardware required is one address generator,
a circular shift register and 2.T buffers.
Figure 8: Hardware required for sequence reordering.
Aijk
2c0
---------
mod2t= A0 βσ2
x
+
2c0
-------------------------
kσ+  
 
mod2t
initial @ of Vi S
(s, sm, d)
1 2
mux
.  .  .  .
2T buffers order
address
generators
arbiter
sequence1 other sequences
to the memory system
supermodule
5.- Conclusions
We have presented a scheme to access in a conflict-free
manner a stream that is divided among P processor ports.
The basis of the scheme is to perform a synchronized and
out-of-order access. In this manner, conflict-free access is
achieved to a window of families of strides. Unlike a
previously proposed method, this scheme does not require
the precomputation of the addresses, but computes them
easily on-the-fly. The method has been presented for a
matched memory system and a crossbar interconnection
network. However, it has been extended to the unmatched
case and to the use of multistage networks.
This work has been supported by the Ministry of Education of
Spain under contract TIC-880/92, by the ESPRIT Basic Research
Action 6634 APPARC and by the CEPBA (European Center for
Parallelism of Barcelona).
References
1. D.T. Harper III, "Address Transformations to Increase
Memory Performance", Int. Conf. on Parallel
Processing, pp. 237-241, 1989.
2. M. Valero, T. Lang, J.M. Llaberia, M. Peiron, E.
Ayguade and J.J. Navarro, "Increasing the Number of
Strides for Conflict-Free Vector Access", Int. Symp. on
Computer Architecture, pp. 372-381, 1992.
3. A. Seznec and J. Lenfant, “Interleaved Parallel
Schemes: Improving Memory Throughput on
Supercomputers”, Int. Symp. on Computer
Architecture, pp. 246-255, 1992.
4. H. Tamura, Y. Shinkai and F. Isobe, “The
Supercomputer FACOM VP System”, Fujitsu Techical
Journal, 1985.
5. M. Peiron, M. Valero, E. Ayguadé and T. Lang,
"Synchronized Access to Streams in SIMD Vector
Multiprocessors", Research Report DAC 93/05, 1993.
6. D.T. Harper III and D. A. Linebarger, "Conflict-Free
Vector Access Using a Dynamic Storage Scheme",
IEEE Trans. on Computers, vol. 40, no. 3, pp. 276-283,
1991.
