In vector multiprocessor systems, collisions in the interconnection network and conflicts in the memory modules are the main causes of the performance degradation. In this work we use a synchronized interconnection network, and propose an interleaved storage scheme and an out-of-order access to the elements of the stream that allow conflict-free access. The streams are generated by the different processors in an asynchronous manner. The mechanism works for the most common strides found in real programs. The hardware required is also described and its complexity is shown to be equivalent to the complexity when the processor requests the elements in order.
Introduction
The structure and organization of the memory system is a fundamental issue in the design of high performance computers. This organization basically depends on the type of the machine and on the characteristics of the memory traffic generated by the processor requests.
In this paper we are concerned with the access to memory in vector multiprocessor systems. Each processor requests streams and scalar data from memory. A stream is a regular sequence of addresses identified by the address of its first element A 0 , the length or number of elements L, and the stride or distance between each pair of consecutive elements, S. If the system has a cache memory to store scalar data, streams with stride one (cache lines) are generated on cache misses. The access to the vectors with indices is considered as a sequence of scalar accesses.
The main purpose in the design of the memory system is to achieve the minimum latency and maximum bandwidth. With this aim, the memory of vector uniprocessor and multiprocessor systems is organized in M modules (usually a power of two, 2 m ) that can be accessed simultaneously. If the access latency of a memory module is T = 2 t processor cycles, the memory can provide a maximum bandwidth of M/T accesses per cycle.
Therefore, in a system with P processors and one port per processor a minimum of P . T memory modules are required to achieve one access per processor and per cycle. The memory system for vector multiprocessors is usually organized as shown in Figure   1 and it is usually a significant part of the cost of the system. In general, the 2 m memory modules are grouped into 2 s sections, and the 2 m-s memory modules in each section are connected by a single bus. Processors are connected to sections through a 2 s x2 s interconnection network. This multimodule memory organization implies the existence of a function that assigns to each address of the processor linear address space a section number, a module number within the section (named supermodule number) and a displacement inside the module. In systems with vector processors, it is important to minimize the latency to access to streams. In vector uniprocessors, conflicts in the memory modules increase the access time.
There is a conflict in a memory module whenever an access to it is attempted before a previous request finishes. The access for the whole stream is named conflict-free if consecutive references to each memory module are separated by T cycles at least. As summarized in [1] , several storage schemes have been proposed to produce either conflictfree access to vectors with some strides or minimum average latency for uniform distribution of strides. In the latter case, buffers can be added to achieve high throughput for long vectors. In [2, 3] a technique was proposed to request the stream elements in an out-of-order manner resulting in an increase in the number of conflict-free families (a family of strides x is defined as the set of strides S=σ . 2 x , where σ is an odd number [4] ).
This technique can be applied to any storage scheme and allows conflict-free access to the maximum number of families of strides. In vector multiprocessors, collisions in the interconnection network and conflicts in the memory modules are the causes of the performance degradation. Several aspects of the processor requests, such as their asynchronous character, differences in the required traffic and randomness of the addresses generated make it difficult to study analitically the memory system performance. Although there are some statistical or mathematical models that allow the prediction of the system behaviour with simple hypothesis and configurations [5, 6] , most of the system evaluations in the literature have been performed by simulation, assuming either real or synthetic input traffic [7, 8, 9] .
In order to increase the efficiency of stream accesses, several proposals have been done. Some of them [10, 11] assume that the processors access to vectors belonging to the same m-s memory modules per section stream and that the access is performed in a synchronous way (processors start the access to the vectors at the same time). Another proposal [12] synchronizes the interconnection network to allow an asynchronous and conflict-free access to different streams.
More specifically, [10] proposes a storage scheme called Interleaved Parallel Scheme (IPS), as well as an out-of-order access mechanism. This allows a conflict-free access for some strides when the interconnection network is a Benes multistage network. Each processor needs to precalculate the addresses of all its stream elements and then the processors send their requests in a synchronized manner. In [11] an alternative solution to the previous problem is presented, based on the techniques developped in [2, 3] for the single-processor case. This solution allows conflict-free access to the same families of strides as in [10] . However, it does not require the precalculation of the addresses of the stream nor the use of a Benes interconnection network; the use of an Omega network is sufficient. The applicability of these techniques is reasonable in specific-purpose applications, but it can also be used for the execution of general-purpose programs. For the latter, [11] shows the applicability of the technique for some programs from the Perfect Club and SPEC benchmarks.
In [12] a technique to synchronize the interconnection network in order to allow a conflict-free access to cache lines is presented. In this synchronous mode of operation, the memory module that can be accessed by each processor at each cycle is predefined. In fact, these cache lines are streams with stride S=1 and length L=2 m . The access to any stream can be initiated by each processor at any cycle in an asynchronous way. In addition, this technique decreases the cost of the interconnection network.
In this paper we assume a synchronous network, as in [12] , and present a storage scheme and an access method based on the out-of-order technique introduced in [2, 3] that allows conflict-free access to other strides, including the most frequent in real programs. The technique is applicable to vector multiprocessors where streams are generated when vector accesses are performed or cache lines are transferred on cache misses. Conflict-free access is obtained for streams with the most common strides found in real programs and length multiple of the number of memory modules. Due to space limitations, the technique is presented in this paper for odd strides.
The organization of the paper is as follows. Section 2 describes conditions for a conflictfree access. Sections 3 and 4 describe how the interconnection network is synchronized and how the access to the stream is performed in an out-of-order manner following the synchronization predefined by the network. Section 5 presents the hardware and shows that it is of similar complexity as the one required for the access in order. Finally, Section 6 concludes the paper, outlines how some of the hypothesis in Section 4 can be relaxed and presents some future work.
Address Mapping and Balanced Streams
Assume the memory organization shown in Figure 1 , and the address mapping of Figure  2 . For each address generated by the processor, bits a m-1:m-s indicate the section, bits a m-s-Assume that one processor requests a stream with any initial address A 0 , length L=2 λ , such that λ≥m and stride S=σ . 2 0 . Since σ is odd, each group of 2 m consecutive elements have different combinations in the lowest m bits that identify the memory module number. If a single processor is considered alone, a stream is named single-balanced when each memory module has at most L/T stream elements. This condition is necessary (but not sufficient) to have a conflict-free access for the stream and assuming that the rest of processors do not access to memory. For instance, if the storage scheme shown in Figure 3 is used, any stream with a stride belonging to family x=0 and length L=2 λ (λ≥m+1) is balanced. However, if the elements of the stream are requested in order, the access may not be conflict-free because consecutive accesses to the same memory module may be separated less than T cycles. However, if the stream is balanced it is always possible to change the access ordering and have a conflict-free access [2, 3] . For instance, for a memory system with eigth memory modules (m=3) and latency T=4, and a stream with A 0 =0, L=16 and S=1, the storage scheme in Figure 3 stores the elements of the stream in memory modules (0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6 , 7, 7). Note that consecutive accesses to the same memory module are separated less than T cycles and therefore the access to the stream is not conflict free. However, the access can be performed out-of-order by first requesting all the even elements (0, 2, 4, 6, 8, 10, 12, 14) and then all the odd elements (1, 3, 5, 7, 9, 11, 13, 15). Now T cycles span between consecutive accesses to the same memory module. In a vector multiprocessor system and in the context of asynchronous access to streams, a stream is named multi-balanced if each memory module has L/P . T stream elements at most and each section has L/2 s elements at most. With the storage scheme of Figure 2 , streams with strides belonging to the families x=0, 1, ..., m-(s+t) are balanced. Since each processor port performs an asynchronous access to its stream, it is seldom possible to obtain a conflict-free access, even when the distribution of the elements of each individual stream allow a conflict-free access. Collisions in the network and conflicts in the memory modules, caused by the simultaneous requests, may lead to a non conflict-free access. To have a conflict-free access the following conditions have to be satisfied:
C1: In each processor cycle, the requests generated by the processors (P at most) must not collide in the interconnection network.
C2: Consecutive accesses to any memory module have to be separated by T cycles at least.
In the next section, we synchronize the interconnection network in order to satisfy the two previous conditions. As we will see, each of the simultaneously requested streams has to be multi-balanced in order to have a conflict-free access.
Network Synchronization
From now on we will focuss on vector multiprocessors with a matched memory system.
This means that the memory system is composed of M=P . T modules. This configuration has the minimum number of memory modules so that P=2 s memory accesses can be performed per processor cycle.
As proposed in [12] , we will assume that the interconnection network works in a synchronous way. In this mode of operation, the memory module that can be accessed by each processor at each network cycle is predefined:
• Each processor accesses a given memory module every P . T cycles (corresponding to the period of the network).
• At each cycle, all the processors only can send requests to different sections ensuring that the permutation of sections does not collide in the interconnection network.
• T cycles at least span between two consecutive accesses to each memory module.
There are several solutions to synchronize the network and fulfil the previous constraints. We will use the following one: at any cycle, all processors access to the same supermodule but in different sections; in addition, during T cycles each processor access the same section but different supermodules, and then it changes to the next section. Figure 4 shows the synchronization of the network for a system with P=4 and T=4. In the table, (i, j) refers to an access to section i, supermodule j. We can see that the period of the network is 16 cycles in this case. 
Out-of-order Access
In this section we explain how a conflict-free access is achieved for the network synchronization proposed in Section 3, in the case that processors request their streams in an asynchronous way. We consider the storage scheme shown in Figure 2 and the access to streams with any initial address A 0 , length L=2 λ such that λ=s+t=m and stride S=σ . 2 0 .
Note that in this case each memory module has just one element of the stream. The access to streams with other lengths and strides is outlined in Section 6. Assume that P-1 processors are currently accessing to memory in a conflict-free way, and that the idle processor initiates the access to memory for a stream with the previous characteristics. If the processor is forced to access to section i and supermodule j in the cycle in which the access is initiated, the address generator should be able to send requests to memory starting with the element stored in (i, j) and following the order predefined by the network synchronization 1 .
For instance, for a stream with initial address A 0 =3, length L=16 and stride S=7, Figure  5 .a shows the sequence of memory modules that would be accessed in order. If the access to this stream is initiated by processor 2 in cycle 14 of Figure 4 (so it has to access to section 1 supermodule 2), the address generator must generate addresses following the synchronization defined by the interconnection network. It has to start with element 5 and proceed with the rest of the elements as shown in Figure 5 .b. For the case we are considering, the address generator requests, in consecutive cycles, elements of the stream that are separated by a constant value equal to 7 (so after the element k, the element ((k+7) mod 16) is requested.)
Next we describe how the address generator computes the sequence of the stream elements assuming the synchronization of the interconnection network presented in Section 3 and shown in Figure 4 .
Let α be the value of bits a s+t-1:0 of the initial address A 0 . This bit field indicates that 1 . In this section we assume that the address generator takes zero cycles to compute the address of the first requested element. Section 5 considers the implementation of the address generator. the first element of the stream is located in section α/2 t  and supermodule α mod 2 t . Let β be the value of the lowest s+t bits of the binary representation of the stride S. And let γ=i . 2 t +j be the lowest s+t bits of the address of the first element that has to be sent to memory in the cycle when the access is started. This is equivalent to say that the processor has to request first the element located in section i supermodule j. Two problems have to be solved to perform the access to the whole stream following the network synchronization order: i) Compute the element ∆ 0 of the stream that is stored in section i, supermodule j.
To solve this problem we have to find a solution to the following expression ii) Find the sequence ∆ 1 , ∆ 2 , ..., ∆ L-1 of the elements that are stored in consecutive sections and supermodules corresponding to consecutive slots assigned to the processor by the network synchronization. In the case of Figure 4 , we see that consecutive slots correspond to consecutive memory modules. Assuming that element ∆ i of the stream has value ω in the lowest s+t bits of the address, we have to find δ such that the element ∆ i+1 =(∆ i + δ) mod 2 s+t has value (ω+1) mod 2 s+t in the same bits. To solve this problem we have to find a solution to the expression or, equivalently with 0 ≤ δ ≤ 2 s+t -1.
Note that the value of δ only depends on the value of β (corresponding to the s+t lowest bits of the binary representation of S), so any pair of elements separated by a constant distance δ are located in consecutive memory modules. If we solve equation (2) and obtain the value of δ, the solution to equation (1) becomes simpler and can be formulated For instance, let's consider again the stream shown in Figure 5 with initial address A 0 =3 (α=3), length L=16 and stride S=7 (β=7) and the access initiated by processor 2 in the cycle 14 of the network. As shown in Figure 4 , at this cycle processor 2 must access to section 1 and supermodule 2 (γ=6). Solving the equation ((δ . 7) mod 16) we obtain δ=7. With this value of δ, the value of ∆ 0 that solves equation (3) is ∆ 0 = (((6-3) mod 16) . 7) mod 16 = 5. Using these values, we obtain the sequence 5, 12, 3, 10, ... shown in Figure 5 .b.
The streams generated on cache misses considered in [12] are a particular case in which α=0 and β=1. With these values we have that ∆ 0 =γ and δ=1. So the access is started with the element indicated by the cycle of the interconnection network and from here, element (k+1) mod L is requested after element k. For instance, for a cache line access initiated at cycle 14 by processor 2 in Figure 4 , the elements would be accessed in the following order: (6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 4, 5) .
Hardware
In this section we describe the hardware required to generate the sequence of elements and addresses described before. First of all, the value of δ can be obtained from β with a small look-up table of 2 s+t words of s+t bits each. Once δ is known, the value of ∆ 0 can be obtained from (3) . Note that α, β, γ and δ are small values (for instance, for stream lengths of 64 or 128 elements they would be 6 or 7 bits wide, respectively) and therefore, the value of ∆ 0 can be computed with a fast and simple hardware. Finally, the address of the element ∆ 0 , which is the first one that has to be sent to memory, is computed as A 0 +∆ 0 . S. The number of cycles ε needed to compute the address of ∆ 0 is quite similar to the number of cycles needed when the access is performed in order. The value of ε depends on the hardware implementation and has to be used to calculate the section and supermodule where the first element ∆ 0 is located. In particular, if (i, j) are the section and supermodule that must be accessed by the processor when the memory access instruction is issued, the value of γ in expression (3) is γ=(i . 2 t +j+ε) mod 2 s+t (that corresponds to the network cycle when the access is actually started). After the first element, the rest of the elements have to be requested in the following order with 1 ≤ k ≤ 2 λ -1. The addresses that correspond to these elements are These addresses can be calculated as follows: Figure 6 shows a possible implementation of the hardware that generates the sequence 
of element numbers and addresses. The hardware needed to compute the initial values of δ and ∆ 0 is omitted. The circuit in Figure 6 .a generates the sequence of the elements ∆ k starting from ∆ 0 . The circuit in Figure 6 .b generates the corresponding memory addresses starting from A 0 + S . ∆ 0 and following the sequencing defined by expression (4); the carry out signal from the adder in Figure 6 .a is used to select between the two choices in expression (4) . The down counter in Figure 6 .c detects when the access to the stream is finished. 
Conclusions
In this paper we have used a synchronized network and proposed an storage scheme and an out-of-order access mechanism that allows conflict-free access for streams with any initial address A 0 , length L=2 s+t , and odd stride S=σ . 2 0 . The streams are generated in an asynchronous way by the processors when accessing vectors with constant stride or cache lines. The out-of-order method is based on the existence of an interconnection network that works in a synchronous way: the memory modules that can be accessed by each processor at each cycle are predefined. The network synchronization forces each processor to access, at each network cycle, a memory module not busy with a previous request. In addition, processors access to sections so that collisions in the interconnection network are avoided. Processors can access their streams in an asynchronous way and the network ensures that the access is performed in a conflict-free manner (that is, with the minimum latency). The hardware needed to perform the accesses is simple and fast with a complexity similar to the complexity of the hardware needed to access the elements in order.
In Section 4 we have assumed that the length of the stream is L=2 s+t . If the length is L=2 s+t+φ with φ>0, there are 2 φ elements in each memory module. In this case, it is easy to have a conflict-free access if the stream is divided into 2 φ substreams of length 2 s+t , and each substream is accessed as described before.
If the length of the stream is L< 2 s+t , the same mechanism can be used and the stream 
can be accessed as if its length were 2 s+t . In this case, there are some cycles where the processor can not send a request to memory and as a consequence, the access latency increases (it will be never greater than the latency needed to access 2 s+t elements). The same situation happens when the length of the stream L is not a multiple of the number of memory modules L=c . 2 s+t +c'. In this case, the access is done with a latency not greater than the latency needed to access a stream of length (c+1) . 2 s+t .
The storage scheme of Figure 2 allows conflict-free access to streams with strides of the family x=0 (odd strides) but it can be modified to increase the number of families for which conflict-free access is obtained [13] . To do that, a block interleaved storage scheme, as the one shown in Figure 7 , and an out-of-order access method as the one described in [2, 3] are used. In this case, it is possible to have conflict-free access for streams with strides belonging to the families x=0, 1, ..., λ-(s+t). For instance, for a vector processor with L=128 (λ=7) and a configuration with 8 sections (s=3) and memory module latency T=4, the conflict free families that are obtained are x=0, 1 and 2, which include the most frequent in current applications [11] . Block-Interleaved storage scheme for a matched memory system that allows conflict-free access for families x=0, 1, ..., λ-(s+t).
