Index Terms: Conflict-free access, Decoupled access, Multi-module memories, Out-of-order access,
Storage schemes, Streams with constant strides, Vector processors.
Introduction
To have sufficient memory bandwidth, the memory of fast processors is organized as several modules that can be accessed simultaneously. To achieve a memory throughput of one access per processor cycle, the number of memory modules should be at least equal to the ratio between the memory cycle and the processor cycle (matched memory system). However, to obtain this throughput, the sequence of requests has to be such that there are no conflicts in the accesses. This is achieved, for example, with conventional memory interleaving and address streams with constant odd strides. However, conflicts occur with other strides and with other address patterns.
In this paper we are concerned with address streams with constant strides. These streams are typical of vector processors (and scalar processors accessing vectors), but appear also in scalar processing when the processor has decoupled memory access and execute. In both cases, to overlap memory access and execution, the processor is decomposed into two independent modules, as shown in Figure 1 . The memory-access module performs LOAD/STOREs of a stream to/from a register file or a local storage; this corresponds to a generalized block transfer with stride not necessarily equal to one. The execute unit obtains the operands from this local storage.
The reduction of latency in the access of constant-stride streams has pursued three different objectives, namely, to achieve conflict-free access for some strides, to achieve peak throughput (for very long streams) for strides that are not conflict free, and to achieve minimum average latency (maybe weighted by the frequency of occurrence of the stride). This has been achieved by the use of alternative addressing schemes, the use of buffers, and the increase in the number of memory modules (unmatched memory system).
In this work we concentrate on the first of these objectives for the largest possible number of strides. We achieve this by accessing the elements of the stream in an out-of-order manner and by constraining the length of the stream to a multiple of the period of the address mapping for that stride. Preliminary results on this work were presented in [15] and extended to obtain conflict-free access to all power-of-two strides in [14] .
As indicated, the scheme we propose limits the length of the streams to be a multiple of the corresponding address-mapping period. This is satisfactory for both vector processor and decoupled-access processors, because in both cases the access to large data structures is divided into shorter pieces (strip-mining). Moreover, we propose that the elements of the strip be requested out of order and that the whole strip be stored in the local memory before its use by the processor.
This prevents the chaining of LOAD/STOREs with other operations; however, this is reasonable because the complex timing of memory accesses make this chaining difficult anyhow. Even though this decoupled operation would be the default, we discuss in Section 3 that the proposed out-oforder access can support chained access/execute in cases where ordered access would make this mode of operation impractical.
We now briefly review previous related work. The two main address transformation schemes proposed in the literature to achieve conflict-free access to streams with strides that produce conflicts with conventional interleaving are skewing and linear transformations. These schemes were initially proposed for array processors [1, 3, 9, 17] and later for vector processors [2, 5, 6, 16] , multiprocessors [10] , and VLIW processors [12] .
If the access is in order and the memory system is matched, these address transformations provide conflict-free access to one family of strides, where the family defined by x is the set of strides σ⋅2 x with σ odd [8] . Moreover, for the case in which different streams are accessed with strides of different families, dynamic schemes based on skewing [8] and on linear transformations [6] were proposed. In general, linear transformations have the advantage over skewing that usually the memory module number is simpler to compute.
A larger number of families of conflict-free strides can be achieved by increasing the number of memory modules (unmatched memory system). If M=2 m is the number of memory modules and T=2 t is the ratio between the memory cycle and the processor cycle, then at most (m-t+1) families are conflict free [6] .
Although out of the scope of this paper, it is worthwhile to mention that techniques have also been proposed to improve efficiency for the cases in which conflict-free access is not achieved. For the skewing and linear schemes mentioned above, peak memory throughput can be obtained for x' < x (being x the conflict-free family) for long streams by the use of buffers [5] . Moreover, schemes based on linear transformations have been proposed to distribute randomly the modules corresponding to consecutive addresses, so that the various strides do not produce clustering to memory modules [12, 13, 16] . In [7] a proposal is made for an analytical model that can be used to make comparisons among these linear transformations. For both schemes, most of the evaluations performed consider long streams, so that the initial transient is not significant and the throughput is determined for the steady state. This throughput is evaluated as a function of several parameters, such as structure of the transformation, number of buffers, and number of memory modules. Although in [8, 12, 16] some measurements are given for short streams, the effect of length is not discussed nor is the transformation determined with a stream length in mind.
To introduce the out-of-order accessing we use a matched memory system and then extend the results to the unmatched case. As addressing scheme we use a linear transformation of the addresses, although the same results can be obtained with block-interleaving (using an internal field of the address as module number) or with skewing. We show that this out-of-order accessing results in a larger number of conflict-free stridesthan what can be achieved with ordered access. The hardware required for address calculations is presented and the efficiency of the scheme is evaluated.
Model and Condition for Conflict-Free Access
We now describe the model of the memory subsystem, present some definitions and give the condition for conflict-free access. Figure 2 shows the general structure of the system which has been used by previous authors. The memory is composed of M=2 m modules and the module latency is of T=2 t processor cycles. Each memory module has q input and q' output buffers. The processor requests one element per (processor) cycle unless it has to wait because the associated input memory buffer is full. The latency of the stream access is defined as the number of processor cycles from the time the processor sends the first address until the last element is received. We assume that the interconnection network is a single bus with a delay of one cycle. Therefore, the latency of a conflict-free access to a stream of length L is (T+L+1) cycles.
We consider the stream length L equal to 2 λ , with λ ≥ m. The first element of the stream has address A 1 and consecutive elements are separated by a constant value S (the stride) so that the i th element has address A 1 +S⋅(i-1). Note that the stream can have any initial address.
Since the memory is organized in several modules, an address mapping is required which transforms the address A (one-dimensional) with binary representation a n-1 , a n-2 , ..., a 1 , a 0 (in short also denoted a n-1..0 ) into the two-dimensional space (module, displacement). Since conflicts depend only on the module number part, we only consider that component of the mapping. That is, the module number b with binary representation b m-1..0 is given by b = F(A) where F is the memory-module component of the address mapping.
We now define the spatial and temporal distributions of the elements of a stream, since these determine the latency of the access.
Definition:
The SPATIAL DISTRIBUTION of a stream in the multi-module memory is the Mtuple SD, where SD(i) is the number of stream elements in module i. Other authors use similar terms related with the spatial distribution, such as short-term and longterm equidistributed sequences [13] , and with the temporal distribution, such as return numbers [11] and variability [7] .
The spatial distribution is LATENCY-MATCHED (T-MATCHED) if SD(i)
Definition: A temporal distribution is CONFLICT FREE when every element can be accessed as soon as it is requested (the corresponding memory module is not busy with a previous request). This is equivalent to stating that a temporal distribution is conflict free if any subset of T consecutively requested elements are located in T different memory modules. This is the condition we will require.
Moreover, from the last definition it follows directly that a necessary condition for a conflictfree temporal distribution is that the stream be T-matched. Because of this, to determine the conditions for a conflict-free temporal distribution, we first determine conditions for a T-matched stream and then consider access orders so that any T consecutive accesses are to different memory modules.
Since the spatial distribution is independent from the temporal distribution, to determine conditions for a T-matched stream we use the canonical temporal distribution, defined below, even though it might not be conflict free.
Definition:
The CANONICAL TEMPORAL DISTRIBUTION of a stream is the temporal distribution when the elements are requested in order.
For any address transformation scheme the canonical temporal distribution is periodic.
Consequently, we can define the canonical temporal distribution in one period and state Lemma 1, below.
The period P x of an address mapping for a stream with stride σ⋅2 x is the period of its canonical temporal distribution.
We call CTP x the canonical temporal distribution in one period for a stream with stride σ⋅2 x .
Definition: CTP x is T-matched if the spatial distribution in the period is T-matched.
Lemma 1:
If CTP x is T-matched and L = p⋅P x with p > 0 then the stream is T-matched.
Proof: This is evident since if each period is T-matched and the stream length is a multiple of the period, the stream has to be T-matched. u
In summary, we consider streams with L = p⋅P x , determine the conditions for having a T-matched CTP x and then find a temporal distribution that has the property that any T consecutive requests go to different modules.
3

Conflict-Free Access for a Large Number of Families
Matched-Memory Case
We now discuss the matched-memory case (M=T) and generalize to the unmatched case in the next section.
For the matched-memory case we use the linear transformation
For the rest of this section we assume this mapping. It has the property that when the elements of the stream are requested in order the access is conflict free for streams with strides of the family with x = s, with any length and with any initial address [6] . Figure 3 illustrates a portion of the mapping for m = t = 3 and s = 3.
As shown in [6] , the period for a stride σ⋅2 x is P x = 2 s+t-x .
We now determine that the families for x≤s produce T-matched streams (if the length of the stream is a multiple of the period).
Lemma 2: Let x ≤ s and P x be the corresponding period. Consider the grouping of these P x elements into 2 s-x substreams consisting of 2 t elements each, so that the j th substream (1 ≤ j ≤ 2 s-x )
contains the elements (j + i⋅2 s-x ) with 0 ≤ i ≤ 2 t -1 ( Figure 4 ). For any of these substreams, all its elements are located in different memory modules.
Proof: Let A j , with binary representation a n- Since σ is odd, the values (i⋅σ) mod 2 t for 0 ≤ i ≤ 2 t -1 are all different. As the bits a t-1..0 are independent of i, the 2 t elements are stored in different modules. u
Lemma 3:
The families of strides that produce T-matched CTP x are those defined by x = 0, 1, ..., s.
Proof: If x ≤ s, because of Lemma 2 the elements (j + i⋅2
s-x ) mod P x for 0 ≤ i ≤ 2 t -1 are in different memory modules. Taking as values for j = 1, 2, ..., 2 s-x we obtain 2 s-x substreams of 2 t elements mapped into different memory modules, so each module contain 2 s-x elements; therefore CTP x is T-matched. On the other hand, if x > s the elements are mapped into just 2
so not all modules are visited (s+t-x < t = m). Therefore, CTP x is not T-matched. u {1} to simplify the notation we mix arithmetic operations, such as integer addition and multiplication, with boolean operators such as XOR. The context is used to determine whether the operand is a bit string or the corresponding integer.
Theorem 1:
The families of strides defined by s-N ≤ x ≤ s where N = min(λ-t, s) produce T-matched streams of length L = 2 λ .
Proof: For a stream to be T-matched it is sufficient that CTP x be T-matched and that L = p⋅P x (Lemma 1). Because of Lemma 3, CTP x is T-matched for all x ≤ s. If x ≥ s-N then x ≥ s -(λ-t) for the definition of N; this can be rewritten as λ ≥ s + t -x, or L = p⋅P x for some p > 0. Therefore the stream is T-mached. u
Note that for x < s-N when λ-t < s it is possible for a stream with length L = 2 λ to be T-matched, but this depends on its initial address. Since we consider streams with any initial address, these cases are not of interest.
Corollary: For fixed λ and t, the value of s defines a window of families of strides that produce T-matched streams, as shown in Figure 5 .
Up to now we have shown the conditions for a stream to be T-matched. However, the access in order can lead to a high latency because of an unsuitable canonical temporal distribution. For instance, the mapping described in Figure 3 results in T-matched streams of length 64 for 0 ≤ x ≤ 3; however, for example, for a stream of stride 7 (x=0, σ=7), beginning at address 63 the period is P x = 64 and the CTP x is 6, 0, 6, 4, 6, 0, 6, 4, 5, 5, 1, 1, 5, 5, 1, 1, ..., 7, 7, 7, 7, 7, 7, 7, 7
The access is not conflict free. In fact, only the family with x = 3 produces a conflict-free canonical temporal distribution.
For the balanced streams that do not produce conflict-free access it is possible to incorporate buffers so that after an initial latency, one data element is obtained per processor cycle. 
Reordering
We now show how to reorder the access of the stream elements so as to achieve a conflict-free temporal distribution.
Theorem 2:
A stream of length L = p⋅P x and x ≤ s can be divided into p . 2 s-x substreams of the form described by Lemma 2, each of which is conflict-free.
Proof: Since L = p⋅P x , the stream can be divided in p periods of length P x . In each of these periods we use the 2 s-x substreams defined in Lemma 2. Since the elements of each substream are mapped in different modules the temporal distribution of each substream is conflict free. u
As a consecuence of Theorem 2, to achieve conflict-free access the stream is divided into substreams and these substreams are accessed in sequence. However, this assures conflict-free access for each substream accessed separately, but not for the whole stream. For instance, for a stream of stride 12 (x=2, σ=3), beginning at address 16 and length 64, the mapping described in that is repeated for each of the four periods of the stream. Each period is divided into two substreams and therefore, the whole stream into eight substreams. Considering the two substreams of the first period, they contain elements (0, 2, 4, 6, 8, 10, 12, 14) and (1, 3, 5, 7, 9, 11, 13, 15) , respectively, which are located in memory modules (2, 5, 0, 3, 6, 1, 4, 7) and (7, 2, 5, 0, 3, 6, 1, 4).
Since there are conflicts among the two substreams, we propose in the next section an additional reordering to provide conflict-free access.
However, it is worth noticing that with one buffer at the input of each module and one buffer at the output, the above mentioned ordering produces a latency which is not greater than 2⋅T+L
cycles, that is, the increase in latency due to the non conflict-free access is at most of T-1 cycles.
Note that the additional latency produced and the number of buffers required are significantly smaller than for in-order access with buffers. This is shown in the following Theorem.
Theorem 3:
The additional latency of a stream access with the reordering proposed above is of at most T-1 cycles.
Proof: Let m 1 ... m M , in this order, be the module numbers accessed by the first substream. Then we can make the following observations:
-module m M begins operation in cycle M and operates continuosly thereafter. This is so because when it finishes with one access the addresses of another complete substream have been calculated, and one of the addresses has to correspond to this module. The last element of this module is available to be sent to the processor in cycle M+(L/M) . T, which is equal to T+L since M=T.
-module m M-1 begins operation in cycle M-1 and is idle at most one cycle thereafter.
This occurs when this module appears as the last module of some substream.
Consequently, the last element of this module is available to be sent to the processor in cycle T-1+L+1=T+L.
-generalizing the process described above, module m i begins operation in cycle i, is idle at most T-i cycles thereafter, and has the last element available at cycle i+L+T-i=L+T.
Consequently, in the worst case, in cycle L+T there are M elements ready to be sent to the processor. However, since only one element can be sent per cycle, the total number of cycles is L+T+T=L+2 . T. Since a conflict-free access requires T+L+1 cycles, the additional latency is of T-1 cycles. The waiting requests are stored in buffers associated to each memory module. Since each substream has only one request per module, only one buffer per module is required. u
We now consider the calculation of the addresses. This is divided into two parts, as shown in The hardware required is shown in Figure 7 . We include also in the Figure the calculation of the local store address. To simplify the implementation it is convenient that the compiler issues instructions to load the values σ⋅2 x , σ⋅2 s and 2 s-x . If this is done, the complexity is practically the same as that for the case in which requests are performed in order.
Conflict-free Ordering
Even though the additional latency associated with the ordering described in the previous section is low in practice (since L>>T), we now propose a scheme that eliminates this additional latency and permits the access of the whole stream in a conflict-free manner.
To
necessary to do this in the order described in the previous section. The solution is to decouple the calculation of the addresses from the actual requests. This is achieved by calculating the addresses of substream j+1 while accessing substream j. For this, during the first 2 t cycles, it is necessary to calculate the addresses of the first substream (which are used immediately for memory access) and of the second substream (which are stored in a set of latches for access as the next substream). After that, for each substream, the addresses for access are obtained from the latches and a new address is calculated to store. Consequently, as shown in Figure 8 , two address generators are needed, although one of them is only used in the first 2 t cycles. Moreover, it is necessary to store the temporal distribution of the first substream, which is used to control the order of the requests of the following substreams. In addition to the latches in the processor, neither buffers are needed in the memory modules nor an arbiter in the return bus to the processor.
Choice of s
As shown previously, the proposed scheme achieves conflict-free access to the families of strides σ⋅2 x such that s-N ≤ x ≤ s, and the choice of s determines the window of conflict-free strides. Since the family for x = 0 includes all the odd strides (and in particular stride one), it is certainly convenient to include this family by making s ≤ λ-t; the largest window occurs when s = λ-t. In such a case, the conflict-free strides belong to the families with 0 ≤ x ≤ λ-t.
For example, for L = 128 and m = t = 3 we choose s = 4 and obtain conflict-free access for the families defined by x = 0, 1, 2, 3, 4.
Unmatched Memory
One way to increase the number of families of strides that produce conflict-free access, is to increase the number of memory modules (m > t).
A possible mapping for the unmatched memory case is to use the same one as defined in (1) replacing t by m. In such a case, the period P x is 2 s+m-x , and for ordered access, any stream length and any initial address, conflict-free access is obtained for the families defined by x = s,..., s+m-t
This can be combined with the technique presented in Section 3.1 for streams of length L, so that conflict-free access is obtained for the families defined by x = s-N, s-N+1, ..., s, ..., s+m-t
A Better Mapping
We now consider a way of increasing the number of conflict-free strides for streams of length L and out-of-order access. To simplify our discussion we consider the special case where m=2⋅t. For this case we use the following mapping: (2) For the rest of this section we assume this mapping. An example for t = 2, m = 4, s = 3 and y = 7 is given in Figure 9 .
This address mapping corresponds to a division of the modules into T sections of T modules and of the address space into blocks of 2 y locations; each block is mapped into one section, using the mapping defined by the lower t bits of b. The period for a stride of the family σ⋅2 x is P x = 2
y+t-x .
We now determine the set of families of strides that produce T-matched streams and an out-oforder temporal distribution to achieve conflict-free access defined by x=s-N, ..., s and x=y-R, ..., y where N = min(λ-t, s) and R = min(λ-t, y) produce T-matched streams of length L = 2 λ . This defines two windows of conflict-free families as shown in Figure 10 . ii) for y-R ≤ x ≤ y, we define substreams as stated in Lemma 4. In this case, the elements of each substream are mapped into different sections, so its temporal distribution is conflict free.
As an example, consider the mapping shown in Figure 9 and let x=4, σ=1, A 1 =6 and L=32. The addresses of the elements in one period P x =32 are shown in bold in Figure 9 . This period is divided into eight substreams, corresponding to the stream elements as follows (0, 8, 16, 24),
(1, 9, 17, 25), (2, 10, 18, 26), ..., (7, 15, 23, 31) . These elements are stored in the memory modules For the calculation of the addresses we use the same algorithm as in Section 3.1.1, where the increment of address in the inner loop is either σ⋅2 s or σ⋅2 y .
As in Section 3.1, we have obtained substreams whose temporal distribution is conflict free; however this might not be the case for the whole stream. For example, consider x=6, σ=3 and A 1 =0.
In this case, P x =8, so there are two substreams in one CTP x , corresponding to the stream elements give an additional reordering that provides conflict-free access to the whole stream.
Conflict-free reordering
To achieve conflict-free access to the whole stream we follow a similar strategy as that presented in Section 3.1.2 for matched memory. However, in that case all substreams visit all memory modules so it is sufficient to remember the order in which the first substream is accessed and use it for the remaining substreams. This is not the case for the unmatched memory case, since different substreams may visit different modules. Because of this we modify slightly the strategy, as explained below. The following definition is useful:
Definition: A section is composed of 2 t modules labeled 0 to 2 t -1. The SUPERMODULE i consists of the i th module of each section. That is, the supermodule number of an address is determined by the bits a s+t-1..s .
Two cases have to be considered, as follows:
i) s-N ≤ x ≤ s: for this case, as stated before, the substreams used are those defined in Lemma ii) for x > s, the section order of the first substream is stored (bits b 2t-1..t ) and the latches are labeled by section.
Choice of s and y
For the proposed reordering, we obtain conflict-free access for the families with x = s-N, ..., s and x = y-R, ..., y. To obtain the largest single window which contains x=0, we make s=λ-t and y = s+1+λ-t= 2 ⋅(λ-t)+1. This results in conflict-free access for 0 ≤ x ≤ 2 ⋅(λ-t)+1. Compared to the mapping used at the beginning of Section 3.2, this provides λ-t+1 additional families.
For example, for L = 128, T = 8 and M = 64 we choose s = 4 and y = 9 and obtain conflict-free access for the families defined by x = 0, 1, ..., 9, as shown in Figure 11 . Moreover, the conflict-free access for families x=4 and x=9 is obtained with access in order without restrictions in the length of the stream. Other families require out-of-order access and stream lengths multiple of their periods (2 6-x if x<6 and 2 9-x if x>6). Note that with this positioning of the y field, the elements that compose the substreams are the same for an x of window 1 and the corresponding x of window 2.
Additional conflict-free families
In this section we show how to obtain additional conflict-free families of strides. In the previous section we obtained two windows of conflict-free families, as follows: We now consider the case in which the substream uses some bits of field a s+t-1...s and some bits of field a y+t-1...y , to produce the required 2 t different combinations. This is useful for families with s+1 ≤ x ≤ s+t-1, which produce T-matched streams as stated in Lemma 5. Note that these families could also be made conflict free using the window rooted in y, but this requires that the corresponding window covers the field a s+t-1...s . By following the approach discussed here, it is possible to move the y-rooted window and increase the number of conflict-free families.
We now find substreams whose elements are located in different memory modules and are therefore conflict free. Consider the grouping of the stream elements into substreams so that the addresses of the 2 t elements of a substream have different values in bits a v+(x-s)-1..v (x-s bits) and a s+t-1..x (t-(x-s) bits) with y ≤ v ≤ y+t-(x-s) and y ≥ s+t as shown in Figure 12 .a. Because of mapping Figure   12 .b. In such a case, the initial element of the j th substream (0 ≤ j ≤ 2 λ-t -1) is
Note that now the separation between addresses of consecutive substream elements is not a constant: in the first part of the substream the elements are separated by σ⋅2 x whereas in the second part they are separated by σ⋅2 λ-(x-s) .
As before, we now place the three windows so as to obtain the largest possible single window which begins at x=0. This results in s=λ-t and y=s+λ. The total window size is then 2 . λ-t.
For example, for L=128, T=8 and M=64, we choose s=4 and y=11 (instead of y=9 as done in Section 3.2.3) as shown in Figure 13 .
The three windows of conflict-free families are [0, ..., 4], [7, ..., 11] and [5, 6] . In the same Figure   we also show the stream elements that make up the substreams for the new conflict-free families.
For instance, a stream of the family x=5 has 16 substreams of 8 elements each as defined above.
The existence of different compositions for the substreams complicate the hardware for address generation and access control. This is the reason why these additional families have not been considered in the development of the previous section.
Generalization to any degree of unmatchness
In the previous section we have used a memory system where the number of memory modules M=2 m is equal to T 2 . The method can be easily generalized to systems with any degree of unmatchness U=M/T (including the matched case as a particular case).
Consider that m=k . t+k' (i.e., U=2 u =2 (k-1) . t+k' ). In this case, the proposed address mapping has k bit-fields: the first k-1 have t bits and the last has t+k' bits.
If the method described in Section 3.2.1 is used, the fields are positioned starting at bits (λ-t)+(λ-t+1) . i with i=0, 1, ..., k-1. In this case, k . (λ-t+1)+k' conflict-free families are obtained. On the other hand, with the method described in Section 3.2.4 the fields are positioned with the low bits in (λ-t)+i . λ with i=0, 1, ..., k-1. In this case, k . λ+k'-t+1 conflict-free families are obtained.
For example, Figure 14 .a shows the address mapping that is obtained for t=2 and m=7 (k=3 and k'=1and therefore a degree of unmatchness U=32) and λ=5. (Section 3.2.4) ).
Evaluation
In this section we present a discussion of the effectiveness of the proposed scheme, in terms of its efficiency, and the cost of implementation. Moreover, we comment on the possibility of using chaining of LOAD/STORE and EXECUTE in important particular cases.
A) Fraction of strides that are conflict free.
We now determine the fraction f of conflict-free strides for the choices of s and y presented in Sections 3.1.3 and 3.2.3. Since the fraction of strides belonging to family σ⋅2 x is 1/2 x+1 , the fraction of conflict-free strides produced by a window from x=0 to x=w is For the matched example given in Section 3.1 the efficiency is η=0.914 whereas for the unmatched memory in Section 3.2 is η=0.997. In comparison, for in order access the highest efficiency is obtained for s=0 (to have conflict-free access for odd strides). This results in η=0.4 for the matched memory case and η=0.84 for the unmatched one.
If for a given application, the actual frequency of different strides is known, the efficiency can be computed by weighting the corresponding number of accesses per cycle. Some measurements of frequency are reported in [4] .
Complexity of the hardware: address calculation, buffers, arbitration, and register addressing.
As shown in Figure 8 , the address calculation for conflict-free access in out-of-order mode requires two address generators instead of one for the standard ordered scheme. Moreover, it requires a buffer of size 2⋅T, a queue to store the temporal distribution of the first substream and an arbiter to issue the remaining substreams in the same order. Finally, a controller is required to perform the address calculations in the required order. We estimate that the hardware cost of these components is a minor part of the cost of the whole processor.
To achieve the required throughput, it might be necessary to pipeline the adder (this would also be needed in the standard ordered access and the latency of the adder would also have to be added to the latency of the stream access, unless several stream accesses are chained together). Moreover, the usual techniques have to be used to eliminate the dependencies in the calculation of successive addresses.
In addition to the buffers mentioned before, no additional ones are needed. This is in contrast with other proposals that include a significant number of buffers to eliminate the effect of an unsuitable temporal distribution.
To support the out-of-order access, elements of the local storage have to be addressed out-oforder. Consequently, this local storage has to be of the random-access type, whereas for ordered access a FIFO organization is adequate. D) Efficiency of more memory modules.
As has been seen in Section 3.2, the addition of memory modules to make the memory unmatched increases the number of families of strides that are conflict free. However, this is obtained at a large expense because to double the number of conflict-free families it is necessary to square the number of memory modules. This is aggravated by the fact that the added families contain fewer strides and that these strides are probably less frequently used.
Of course, the addition of memory modules can be justified by other reasons, such as simultaneous access to several streams and non regular data access. E) Possibility of chaining of LOAD and EXECUTE.
As mentioned in the introduction, the complicated timing produced when access in order is coupled with buffers, makes it impractical to chain two instructions if one of the operands of the second is being obtained with a LOAD. In contrast, the scheme proposed produces one stream element each cycle in a deterministic order (for conflict-free strides). Consequently, it is possible to perform the chaining if the first instruction is executed using the same order of elements as the LOAD. Note that the sequence of local store addresses is produced anyhow as part of the LOAD. F) Conflict-free families and stream length.
For unmatched memory, the access in order produces at most t+1 conflict-free families for any stream length (for m = 2⋅t). In contrast, the scheme we propose produces only two conflict-free families for any stream length, but increases to 2⋅(λ-t+1) the number of conflict-free families for streams of length L = 2 λ .
Conclusions
In this paper we have considered the access of streams of fixed length, equal to the length of a vector register in the case of a vector processor, or of a fixed portion of the local memory (for block transfers). The access patterns correspond to constant strides and the stream can begin in any address. The basic idea we propose is an out-of-order access of the elements of the stream to achieve conflict-free access for all strides that produce T-matched streams.
We present an address mapping and an out-of-order access method to achieve conflict-free access for a large number of families. We first consider the matched memory case, where M=T. In this case, we obtain a window of λ-t+1 families of strides that are conflict free, whereas previous schemes that perform the access in order result in a single conflict-free family (for streams of any length). To achieve this, we divide the stream in substreams which are accessed in a conflict-free manner. This by itself does not produce conflict-free access to the whole stream, although the added latency is low. To achieve the conflict-free access to the whole stream we propose that an additional set of T addresses be precalculated and latched, so that the temporal distribution of all substreams is the same.
We present then an extension of this scheme for the unmatched memory system. The case for M=T 2 (m=2 . t) has been considered, resulting in 2 . (λ-t)+1 conflict-free families; this compares favorably to the t+1 conflict-free families obtained using access in order. The method can be applied to memory systems with any degree of unmatchness (m=k . t+k'). As stated in Section 3.2.4, more conflict-free families can be obtained at a cost of complicating somewhat the substream generation for the additional families.
The ideas have been presented using an address mapping based on linear transformations.
However, the same results can be achieved with block interleaving or with skewing. In these cases, we have to select, in a suitable manner, the bits that determine the module number in the interleaved case, and the number of rows to rotate for skewing. The difference between these schemes is the behavior for streams of length smaller than L and for streams that cannot be accessed in a conflictfree way.
We propose the use of two address generators. This allows us to achieve the minimum latency of T+L+1 cycles for a stream read of length L. In general, the required hardware for address calculation and request control are somewhat more complex than that required for in-order access;
however, this added complexity should be negligible compared to that of the whole processor.
We plan to extend this work to the case in which several streams are accessed simultaneously, either in a single processor with several memory ports or in a SIMD vector multiprocessor. We also will explore further the use of these techniques for scalar processors with decoupled memory access and execute units. 16,32,48,64,80,96,112), (1,17,33,49,65,81,97,113) , ..., (15,31,47,63,79,95,111,127)   (0,8,16,24,32,40,48,56) , ..., (7,15,23,31,39,47,55,63)   (0,4,8,12,16,20,24,28) , ..., (3, 7, 11, 15, 19, 23, 27 ,31) (0,2,4,6,8,10,12,14) , (1, 3, 5, 7, 9, 11, 13, 15) 
