Address transformation schemes, such as skewing and linear transformations, have been proposed to achieve conflict-free vector access for some strides in vector processors with multi-module memories. In this paper, we extend these schemes to achieve this conflict-free access for a larger number of strides. The basic idea is to perform an out-of-order access to vectors of fixed length, equat to that of the vector registers of the processor. Both matched and unmatched memories are considered; we show that the number of strides is even larger for the latter case. The hardware for address calculations and access control is described and shown to be of similar complexity as that required for access in order.
Introduction
To have a sufficient memory bandwidth, the memory of fast processors is organized as several modules that can be accessed simultaneously.
To achieve a memory throughput of one access per processor cycle, the number of memory modules should beat least equat to the ratio between the memory cycle and the processor cycle (matched memory system). However, to obtain this throughput, the request sequence has to be such that there are no conflicts in the accesses. This is achieved, for example, with conventional memory interleaving and for vectors with odd strides, but not for other strides or more unstructured patterns. This has motivated the proposal of other addressing schemes, the use of buffers, as well as the increase of the number of memory modules (unmatched memory system).
In this paper we consider ways of obtaining minimum latency for address streams which correspond to a single vector of constant stride, fixed length and any initial address. This addressing pattern is typical of vector processors (and scalar processors accessing vectors), but does appear also in general scalar processing when the processor has decoupled memory access and execute. In both cases, to overlap memory access and execution, the processor is decomposed into two independent modules, m shown in Figure 1 . The memory-access module performs loads and stores to/from a "register file" and the execute unit obtains the oper,ands from that file. the instrtrction set. In the case of scakm processors, these regular patterns are convenient to achieve high effective memory bandwidth. These patterns could also be useful in a software-controlled cache in cases in which the spatial locality of the data corresponds to these blocks of "equally spaced"
items.
We consider the case in which the processor accesses vectors of length equal to the number of elements of one of its vector registers, because this is the way the LOAD and STORE instructions operate. Since the size of the vectors is usually much larger than the size of vector registers, the shove mentioned mode of operation requires strip-mining by Ihe compiler so that a very high fraction of the accesses are of vectors of length equal to that of the registers. In Section 5, we discuss the access of shorter vectors.
Moreover, we propose that the elements of the vector be requested out of order and that the whole vector be stored in the vector register before its use by the processor. This prevents the chaining of LOAD/STOREs with other operations; however, this is reasonable because the complex timing of memory accesses m,ake this chaining difficult anyhow. Even though this decoupled operation would be the default, we discussin Section 5 that theproposed out-of-order access can support chained access/execute in cases where ordered access would make this mode of operation impractical.
The two main address transformation schemes proposed in the litemture to achieve conflict-free access to vectors with strides that produce conflicts with conventional interleaving are skewing and linear transformations. These schemes were initially proposed for array processors [1, 2, 3, 4] and later for vector processors [5, 6, 7, 8] A larger number of families of conflict-free strides can be achieved by increasing the number of memory modules (unmatched memory system). If M=2n1 is the number of memory modules and T=2t is the ratio between the memory cycle and the processor cycle, then at most (m-t+ 1) families are conflict free, assuming that the elements are requested in order [6] .
Although out of the scope of this paper, it is worthwhile to mention that techniques have also been proposed to improve efficiency for the cases in which conflict-free access is not achieved. For the skewing and linear schemes mentioned above, peak memory throughput can be obtained for x' < x for long vectors by the use of buffers [5] . Moreover, schemes based on linear transformations have been proposed to distribute randomly the modules corresponding to consecutive addresses, so that the various strides do not produce clustering to memory modules [8, 10, 12] . Recently a proposal has been made [13] for an analytical model that can be used to make comparisons among these linear transformations. For both schemes, most of the evaluations performed consider long vectors, so that the initial transient is not significant and the throughput is determined for the steady state. This throughput is evaluated as a function of several parameters, such as structure of the transformation, number of buffers, and number of memory modules. Although in [8, 10, 11] some measurements are given for short vectors. the effect of length is not discussed nor is the transformation determined with a vector length in mind.
To introduce the out-of-order accessing we use a matched memory system and a linear transformation of the addresses, although the same results can be obtained with interleaving (using an internal field of the address as module number) or with skewing. We show that this mode of accessing vec(ors of fixed size results in conflict-free accesses for a karger number of families of strides than ordered access. The case of unmatched memory system is also studied. The hardware required for address calculations is presented and the efficiency of the scheme is evaluated.
The results in this paper should be extended to the cases in which several vectors are accessed simultaneously by a single processor or by several processors in a multiprocessor system.
Model and Condition for Conflict-Free Access
We now describe the model of the memory subsystem, present some definitions and give the condition for conflictfree access. Figure 2 shows the general structure of the system which has been used by previous authors. The memory is composed of M=2'11 modules and the module latency is of T=2t processor cycles. Each memory module has q input and q' output buffers. The processor requests one element per (processor) cycle unless it has to wait because the associated input memory buffer is full. The latency of the vector access is defined m the number of processor cycles from the time the processor sends the first address until the last element is received. We assume that the interconnection network is a single bus with a delay of one cycle. Therefore, the latency of a con- We consider the vector length L eqtrat to 2k, with k 2 m. The first element of the vector has address Al and consecutive eletnents are separated by a constant value S (the stride) so that the i-th element has address A1+S.(i-1). Note that the vector can have any initial address. As done in [11], we classify the strides into families defined by x so that atl strides 0,2X with ts odd belong to the same family.
Since the memory is organized in several modules, an address mapping is required which transforms the address A (one-dimensional) with binary representation 4,. where mi is the module corresponding to the i-th processor request. Note that the elements can be requested in any order.
Other authors use similar terms related with the spatial distribution, such as short-term and long-term equidistributed sequences [12] , and with the temporal distribution, such as return numbers [14] and variability y [13] .
Definition:
A temporat distribution is CONFLICT FREE when every element can be accessed as soon as it is requested (the corresponding memory module is not busy with a previous request). This is equivalent to stating that a temporal distribution is conflict free if any subset of T consecut ively requested elements are located in T different memory modules. This is the condition we will require.
Moreover, from the last definition it follows direcdy that a necessary condition for a conflict-free temporal distribution is that the vector be T-matched. Because of this, to determine the conditions for a conflict-free tempoml distribution, we first determine conditions for a T-matched vector and then consider access orders so thnt any T consecutive accesses are to different memory modules.
Since the spatial distribution is independent from the temporal distribution, to determine conditions for a T-matched vector we use the canonical temporal distribution, defined below, even though it might not be conflict free.
The CANONICAL TEMPORAL DISTRIBU-TION of a vector is the temporal distribution when the elements are requested in order.
For linear mappings (and for skewing) the canonical temporal distribution is periodic. Consequently, we can define the canonical temporaJ distribution in one period and state Lemma 1, below.
The period Px of an address mapping for a vector with stride 0.2X is the period of its canonical temporal distribution.
We call CTPX the canonical temporal distribution in one period for a vector with stride G.2X.
Definition: CTPX is T-matched if the spatial distribution in the period is T-matched.
Lemma 1: If CTPX is T-matched and L = kPx with k >0 then the vector is T-matched.
Proof: This is evident since if each period is T-matched and the vector length is a multiple of the period, the vector has to be T-matched. o
In summary, we consider vectors with L = kPX, determine the conditions for having a T-matched CTPX and then find a temporal distribution that has the property that any T consecutive requests go to different modules.
Matched Memory
We now discuss the matched-memory case (M=T) and generalize to the unmatched case in the next section.
For the mfitched-memory case, we choose F as the linear
For the rest of this section we assume this mapping. It h,asthe property thot when the elements of the vector are requested in order the access is conflict free for vectors with strides of the family with x = s. of any length and with any initial ad- for the definition of N: this can be rewritten x > s + t -x, or L = k.px for some k >0. Therefore the vector is T-reached. c1
Note that for x < s-N when L-t <s it is possible for a vector with length L = 2L to be T-matched, but this depends on its initial address. Since we consider vectors with any initial address, these cases are not of interest.
Corollary: For fixed k and t, the value ofs defines a window of fam ilies of strides that produce T-matched vectors.
Up to now we hove shown the conditions for a vector to be T-matched. However, the access in order can lead to a high latency because of an unsuitable canonical temporal distribution. In the example of Figure 1 , the vectors of length 64 are T-matched for O~x~3. Consider the access of a vector with stride 12 and whose first element is in position 16. Since x = 2 the period is Px = 16 and the CTPX is 2, 7, 5, 2, 0, 5, 3, 0, 6, 3, 1, 6, 4, 1, 7, 4 and this sequence is repeated for each of the four periods of [he vector. The access is not conflict free. In fact only the family with x = 3 produces a conflict-free canonical temporal distribution.
Reordering
We now show how to reorder the access of the vector elements so as to achieve a better temporal distribution.
Theorem 2: The elements of a T-matched vector with length L = k.Px for some k >0 can be grouped in subsequences such
[hat the temporal distribution of each subsequence is conflict free.
I'roofi Since L = kPx, the vector can be divided in k subvectors of length PX. In each of these subvectors we use thẽ S-x subsequences defined in Lemma 2. Since the elements 4,, of each subsequence are mapped in different modules the lemporal distribution of each subsequence is conflict free.(1 For the calculation of the addresses of the consecutive eletnenls in a subsequence it is necessary to increment by 0.2s, instead of 0.2X for the canonical order. The order of the subsequences is not important. One possibility is to request all subsequences in one period and then go to the following period, and so on. In such a case, the fwst elements of consecutive subsequences in one period are separated by 62X, which is also the separation between the last element of one period and the first of the next. The control to perform therequesls in this order is shown in Figure 4 .
The hardware required to generate the addresses is shown in Figure 5 . To simplify the implementation it is convenient that the compiler issues instructions to load the values 0.2', 0.2s and 2S-X.If this is done, the complexity is practically the same as that for the case in which requests are in order.
For the previous example, for the first period we obtain two subsequences that contain the vector elements (O, 2,4,6, 8, 10, 12, 14) and (1, 3, 5, 7, 9, 11, 13, 15) , respectively. These are located in modules (2, 5, 0, 3, 6, 1, 4, 7) and (7, 2, 5,0, 3, 6, 1, 4). Note that even if each subsequence is conflict free, the access [o [hc whole vector is not, because the temporal distributions of the different subsequences are not the same. Consequently, in the next section we give an additional reordering that provides this conflict-free access.
However, it is worth noticing that with two buffers at the input of each module and one buffer at the output, the above mentioned ordering produces a latency which is not greater than 2T+L cycles, that is, the increase in latency due to the non conflict-free access is at most of T-1 cycles [151. 
Conflict-free Ordering
Even though the additional latency associated with the ordering described in the previous section is low in practice (since L>>T), we now propose a scheme that eliminates this additional latency and permits the access of the whole vector in a conflict-free manner.
To achieve this, it is necessary to incorporate a second reordering so that the temporal distribution of all subsequences is the same. However, this poses a problem with the calculation of the addresses inside the subsequences, since to have a simple incremental calculation (adding C.2S) it is necessary to do this in the order described in the previous section. The sohttion is to decouple the calculation of the addresses from the actual requests. This is achieved by calculating the addresses of subsequence i+ 1 while accessing subsequence i.
Consequently, during the fwst 2t cycles. it is necessmy to caiculate the addresses of the first subsequence (which are used immediately for memory access) and of the second subsequence (which are sto~d in a set of latches for access as the next subsequence). After that, for each subsequence, the addresses for access are obtained from the latches and a new address is calculated to store. Consequently, as shown in Figure 6 , two address generators are needed, although one of them is only used in the first 2Lcycles. Moreover. it is necessary to store the temporal distribution of the first subsequence, which is used to control the order of the requests of the following subsequences. In addition to the latches in the processor, no buffers are needed in the memory modules, 
Choice ofs
As shown previously. the proposed scheme achieves conflict-free access to the families of strides CT.2Xsuch that s-N < x < s, and the choice ofs determines the window of conflict-free strides. Since the f,amily for x = O includes all the odd strides (and in particular stride one), it is certainly convenient to include this family by makings < h-t; the largest window occurs whens = k-t. In such a case. the conflictfree strides belong to the families with Os x < L-t.
For example, for L= 128 and m = t = 3 we chooses = 4 and obtain conflict-free access for the families defined by 
. s+m-t
In this case, the vectors with strides belonging to the families X=s, s+l , .... s+m-t are accessed in order and the rest are accessed out-of-order using the results of Section 3. In particular, ifs = L-t, the conflict-free families have 0< x < l+m-2t.
A Better Mapping
We now consider a way of increasing the number of conflictfree strides for vectors of length L and out-of-order access. To simplify our discussion we consider [he special case where m=2t. For this case we use F as the following mapping:
For the rest of this section we assume this mapping, An example for t = 2, m = 4,s = 3 and y = 7 is given in Figure 7 . This mapping corresponds to a division of the modules into T seclions of T modules and of the address space into blocks of 2Y loctitions; each block is mapped into one section, using the mopping defined by the lower t bits of b.
The period for a stride of the family r3.2x is PX = [2Y+t-xl. To correctly partition the set of families of strides, we will assume from now on that y-R 2 s+l; this implies R = l-t and y 2 L-t. Therefore there are two groups of families of strides, those with x in [s-N, s] and those in [y-R, y], As a consequence of the previous lemmas and theorem, depending on the value of x, one of the following subsequences is used: i) for s-N < x < s, we define subsequences as stated in Lemma 2. We know that its elements are mapped into different modules because they have different combinations in the bits s+t-1..s; therefore the temporal distribution is conflict free. For some values of 6 and Al, also the bits y+t-1..y of the elements in one subsequence may vary; in this case many sections are visited in one subsequence, which also leads to a conflict-free temporal distribution of the subsequence.
ii) for y-R s x S y, we define subsequences as stated in Lemma 4. Its elements are mapped into different sections, so its temporal distribution is conflict free.
For example, consider the mapping shown in Figure 7 and let x=4, rs= 1, AI=6 and L=32. The elements of the period Px are marked in italic in Figure 7 . There are eight subsequences in one period, corresponding to the elements of the vector (O, 8,16, 24) , (1, 9, 17, 25) , (2, 10, 18,26), .... (7, 15,23, 31) . These elements are located in the modules (2, 6, 10, 14) , (0, 4, 8, 12) , (2, 6, 10, 14) , .... (O, 4,8, 12 ).
For the calculation of the addresses we could use the same algorithm as in section~. 1, where the increment of address in the inner loop is either 02s or 0.2Y.
As in section 3, we have obtained subsequences whose tetnporal distribution is conflict free; however this might not be the case for the whole vector. For example, consider x=6, 0=3 and A1=O. In this case, PX=8, so there are two subsequences in one CTPX, corresponding to the vector elements (O, 2,4. 6) and (1,3,5, 7). These elements are located in the modules (O, 12, 8, 4) and (4, O, 12, 8) respectively. Again, next we give an additional reordering that provides conflictfree access to the whole vector.
4.2.
Conflict-free reordering for unmatched memory
In the case of matched memory, all subsequences contain all modules. so it is sufficient to remember the order in which the first one is requested and use it to access the remaining subsequences. This is not the case for unmatched memory, since different subsequences may contain different modules.
To achieve the conflict-free access to the whole vector we will apply the same strategy m in section 3.2 with some modifications depending on the value of x, as explained below.
The following definition is useful:
A section is composed of 2t modules labeled O to 21-1. The SUPERMODULE i consists of the i-th module of each section. That is, the sttpermodule number of an address is determined by the bits as+t.l..s.
Two cases have to be considered, as follows: i) s-N < x < s: for this cage, as stated before, the sttbsequences used are those defined in Lemma 2. Since the 2' elements in these subsequences have different combinations in the bits A+t.l ,,S,all supermodules appear in each subsequence. Therefore we can apply the same stmtegy as in section 3.2 but at the supermodule level. i.e.: the supermodule numbers of the first subsequence must be remembered and the elements of the remaining subsequences should be requested in ii) the same "supermodule order" as the first one. Note that this implies that two latches are needed per supermodule, not per module. This results in 2.2t latches (not 2.2m). y-R s x < y: now the subsequences are defined by Lemma 4. Since its elements are in different sections.
it is sufficient to request each subsequence in the section order of the first subsequence.
In summary,
for x s s the supermodule order of the first subsequence is stored (bits bt.l ,,.) and the latches are labeled by supermodule.
for x >s, the section order of the first subsequence is stored (bits b2t.1..t) and the latches are labeled by section.
Choice ofs and y
For the reordering proposed, we know how to obtain a conflict-free access to the families with x = s-N, .... s and x = y-ft, .... y; making y-R= s+ 1 a single window of N+R+2 families is obtained. As discussed in Section 3, a convenient choice is s=k-t; for this case, to achieve the largest possible single window, we choose y = A-t+ l+k-t= 2 (x-t)+ 1
Consequently, the conflict-free families correspond to o<x<2(k-t)+l
Compared to the mapping used at the beginning of Section 4, this provides h-t+ 1 additional families.
For example, for L = 128, T = 8 and M = 64 we choose s = 4 and y = 9 and obtain conflict-free access for lhe families defined by x = O, 1, ....9.
Evaluation
In this Section we present a discussion of the effcclivcness of the proposed scheme, in terms of its efficiency, the cost of implementation, and its completeness to handle any stride and any vector length. Moreover, we comment on the possibility of using chaining of LOAD/STORE and EXECUTE in important particular cases, A) Fraction of strides that are conflict free.
We now determine the fraction f of conflict-free strides for the choices ofs and y presented in sections 3 and 4.
Since the fraction of strides belonging to family 02X is 1/2'+1, the fraction of conflict-free strides produced by a window from x=O to X=W is
Consequently, for the matched memory system case (w=k-t) we get In general, divide the vector into two parts, one of length equal to V above and the other of the rest. Access the first part using the scheme of Sections 3 and 4 and the second part in order. This separation can be done by the compiler.
ii) the length of the vector is a multiple of L. This occurs for the case of multiple-size registers. In such a case the same scheme described in Sections 3 and 4 is used for each portion of length L. The overall efficiency is the same as for vectors of length L. D) Complexity of the hardware: address calculation, buffers, arbitration, and register addressing.
As shown in Figure 6 , the address calculation for conflictfree access in out-of-order mode requires two address generators instead of one for the standard ordered scheme. Moreover, it requires a buffer of size 2T, a queue to store the ternporal distribution of the first subsequence and an arbiter to issue the remaining subsequences in the same order. Finally. a controller is required for the address calculations in the required order. We estimate that the hardware cost of these components is a minor part of the cost of the memory subsystem. To achieve the required throughput, it might be necessary to pipeline the adder (this would also be needed in the standard ordered access and the latency of the adder would also have to be added to the latency of the vector access, unless several vector accesses are chained together). Moreover, the usual techniques have to be used to eliminate the dependencies in the calculation of successive addresses.
In addition to the buffers mentioned before, no additional ones are needed. This is in contrast with other proposals that include a significant number of buffers to eliminate the effect of an unsuitable temporal distribution.
To support the out-of-order access, elements of the vector register have to be addressed out of order. Consequent y, t his register has to be of the random access type, whereas for ordered access and return a FIFO organization is adequate.
E)
Efficiency of more memory modules.
As has been seen in Section 4, the addition of modules to make the memory unmatched increases the number of families of strides that are conflict free. However, this is obtained at a large expense because to double the number of conftictfree families it is necess.wy to squ,are the number of modules. This is aggravated by the fact that the added families contain fewer strides and that these strides are probably less t_re-quently used.
Of course, the addition of memory modules can be justified by other reasons, such as simultaneous access to several vectors and non vector access,
F)
Possibility of chaining of LOAD and EXECUTE.
As mentioned in the introduction, the complicated timing produced when access in order is coupled with buffers, makes it impractical to chain two instructions if one of the operands of the second is being obtained with a LOAD. In contrast, the scheme proposed produces one vector element each cycle in a deterministic order (for conflict-free strides). Consequently, it is possible to perform the chaining if the fwst instruction is executed using the same order of elements as the LOAD. Note that the sequence of addresses to the register elements is produced anyhow as part of the LOAD. G) Maximum number of conflict-free families for the unmatched case.
The number of conflict-free families obtained in section 4 is not the maximum achievable with out-of-order access. In fact, it is possible to have t-1 more families [15] . However, the structure of the subsequences for these t-1 additional families is different that presented. Because of this, the inclusion of these families would complicate the hardware for address generation and access control.
H) Conflict-free f,amilies and vector length.
For unmatched memory, the access in order produces at most t+ 1 conflict-free families for any vector length (form = 2t). II] contrast. the scheme we propose produces only two contlict-free families for any vector length, but increases to 2(k-t+ 1) the number of conflict-free families for vectors of length L = 21.
Conclusions
In this paper we have considered the access of vectors of fixed lenglh, equal to the length of a vector register. The access patlerns correspond to constant strides and the vector can begin in any address. The basic idea we propose is an out-of-order access of the elements of the vector to achieve conflict-free access for all strides that produce T-matched vectors. We first consider the matched memory case, where M=T. In this case, we obtain a window of L-t+l families of strides that are conflict free, whereas previous schemes that perform the access in order result in a single conflict-free family (for vectors of any length). To achieve this, we divide the vector in subvectors which are accessed in a conflict-free manner. This by itself does not produce conflict-free access to the whole vector, although the added latency is low. To achieve the conflict-free access to the whole vector we propose Ihat an additional set of T addresses be calculated and latched, so that the temporal distribution of atl subsequences is the same. The analysis of the required hardware for address calculations shows that, with compiler support, the complcxi[y is similar to Ihat of the address generator for access in order.
We present lhen an extension of this scheme for the unmatched memory case. For a number of modules M=T2, the size of the conflict-free window is doubled; this compares favorably to the t+ 1 conflict-free families obtained with ordered access. Note however, as we discuss in Section 5, that [he resulting increase in efficiency from the matched to the unmatched case is obtained at the cost of squaring the number of memory modules.
We discuss also the access of vectors that are shorter than the size of the vector registers. In this case, depending on the stride family. we propose a combination of out-of-order and ordered access. If the length of the vector is known at compile time, the division of the vector in this two subvectors can be done by the compiler.
The ideas have been presented using an address mapping based on linear transformations.
However, the same results can be achieved with interleaving or with skewing. For this, it is necessary to select in a suitable manner the bits that determine the module number in the interleaved case, and the number of rows to rotate for skewing. The difference between these schemes is the behavior for vectors of length smaller than L.
We plan to extend this work to the case in which several vectors are accessed simultaneously, either in a single processor with several memory ports or in a multiprocessor with vector processors. We also will explore further the use of these techniques for scalar processors with decoupled memory access and execute units.
Acknowledgments
This work has been supported by the Ministry of Education of Spain under contract TIC-299/89 and by the CEPBA (European Center for Parallelism of Barcelona).
