Memory interleaving considerably increases memory bandwidth in vector processor systems. The concurrent operation of the processors can produce memory bank con icts and hence alter the memory bandwidth. Total or steady state performance for vector operations in a memory system is studied. Many methods of resolving memory bank con icts are proposed and compared. Analytical results on the resulting e ective bandwidth are presented for one of them and the others are described by exhaustive simulations. Some nonintuitive results are obtained on how con icts depend on the size of the architecture, the number, the stride and the length of the vectors, the register length assigned by each processor to vector components.
Introduction
As the speed of CPUs increases, memory latency becomes the signi cant bottleneck of computer performance. One class of multiprocessors, the tightly-coupled multiprocessors, are designed with a global shared memory in order to solve data transfers problems. Designers must nd for them the right relationship between the memory bandwidth (number of memory accesses per cycle) and the processor bandwidth (number of requests per cycle) in order to optimize the use of memory cycles, so that a maximal memory bandwidth is balanced with a minimal latency for requests. The main factor which decreases the memory bandwidth is the memory bank contention. Thus, when designing tightly coupled MIMD computers, e cient hardware management of memory bank contention is a crucial issue. However, the problem is not easy because the memory contention depends on the rules of arbitration of con icts and on a large number of architectural parameters: number of processors, number of banks, register length, latency between the loading of the registers, etc. The in uence of these parameters is not well understood and nonintuitive. For example, it is commonly thought that increasing the size of the architecture (the number of processors and memory banks) increases the memory contention, but the speci c in uence of the number of processors and the number of banks is not clear.
One di culty encountered in the analysis of the memory contention is how to choose a representation of the sequences of addresses generated by the programs. Most of the time, the accesses to the memory banks are assumed to be random, mainly because of the mathematical tractability of this model. In this case, exact or approximated models have been analyzed in 2], 1] (see also 3], 6], 9], 10], 11]). However, in the context of scienti c computation, the hypothesis of independence of successive addresses is very unlikely (except perhaps in sparse computation). In this case, the programs are frequently based on the execution of loops, generating accesses to arrays referenced by a linear index . These loops generate a sequence of references to the successive components of vectors stored in contiguous memory banks (vectors of stride one) or in equally spaced memory banks (vectors of stride greater than one). Although this is a restricted model, these regular accesses are basic and represent a wide class of loops in programs. However, there has been little research on these regular memory access patterns. The rare studies on such vector operations have been done for the CRAY X-MP memory system by Cheung and Smith ( 5] ) and Oed and Lange ( 7] , 8]), but only in the case where the vector lengths are less than the register length. They show that the contention depends strongly on the vector starting addresses. Hence they consider exhaustively all the possible vector patterns. The complexity of this combinatorial problem explodes with the number of the vectors so that their study is limited to two vectors for the analytical results and three vectors for the simulations results.
In this work we analyze regular references streams generated by operations on vectors (of the same length with di erent starting addresses) where the vector lengths can be greater than the register length. The target architectures are MIMD tightly coupled processors with one port per processor. The size of the architecture in this study is su ciently small (less than 16 processors and 16 memory banks), so that a crossbar network can be used as the interconnection network, resulting in no con icts except the memory bank con icts. An example of such an architecture is the ALLIANT FX/8, developed by the University of Illinois (Urbana-Champaign) with eight processors and a shared cache divided into four banks. With the reference streams described above, di erent types of con ict management are considered. The static priority protocol, for which processor i has priority over processor i 0 if i < i 0 in the case of a con ict between i and i 0 , was implemented in a version of the ALLIANT FX/8 to dynamically resolve con icts. This con ict resolution scheme always penalizes the same processor. In this paper, we propose other schemes to reduce this problem and hence to improve memory performance. These include priority policies and a queueing policy. The priority policies are variants of the static priority, with the priority depending on time and even on the memory address of the request: cyclic priority , as between the two processors of the CRAY X-MP, rotation and con ict priorities.
The purpose of this paper is to evaluate and to compare the performance of the di erent con ict resolution schemes and to study its evolution when architectural parameters are modi ed: the size of the architecture and the length of the registers assigned by each processor to components of each vector. The con ict rate is the proportion of the total execution time wasted in latency. This is the performance measure describing the impact of the con icts.
The rst way to have an idea of the performance is to perform simulations: we use an exhaustive technique, where all the vector patterns are simulated, as in 5]. The simulation results, obtained by a simulator called MEVAMP, are presented in graphical form. They cover the cases of practical interest and allow us to compare the di erent policies and to study the in uence of the parameters of the architecture.
One of the policies, the rotation priority, is analyzed analytically. It yields results for a larger range of parameters. Moreover, according to the simulation results mentioned above, this policy is quite e cient compared to the other policies. Its analytical study is therefore of special interest. We obtain the analytical expressions for the delay due to con icts and the mean delay when the starting addresses and possible durations to load registers that create con icts are randomly chosen. Upper and lower bounds are derived for the mean con ict rate, which give a good idea of the in uence of the size of the architecture . This gives us some limits on the scalability of the architecture. For example, we prove that, under a certain condition on the bandwidths, it is better to have 2n processors and n banks than n processors and 2n banks. In Section 2, the model of the architecture is described. Sections 3 and 4 deal with the analytical results. In Section 5, we discuss how the simulations were performed and present the results. Section 6 contains some extensions of the simulation results, including non unit strides. The technical proofs are given in the appendix. The model analyzed in this paper is an architecture of n p processors indexed by i (0 i n p ? 1) which share a memory divided into n b directly accessible banks indexed by j (0 j n b ? 1) (see Fig. 1 ). The interconnection network is a crossbar network, so no con icts between requests are introduced by the network. Thus all the con icts are bank con icts.
The memory system.
The memory banks are directly accessible by the processors. A successful request to a bank occupies that bank during a xed number of cycles, denoted t b , called the memory occupation time. Two or more processors may request the same bank simultaneously, resulting con icts on the bank. When such a con ict occurs, the time required to receive the requested data will be the associated delay, denoted t w , plus the memory occupation time. The delay t w depends on the con icts with other possible requests for data stored in the same bank. A processor has the capability of delaying a request if it can not access bank because of some con ict. Con icts are resolved dynamically, i.e. all the requests that cannot be satis ed will be delayed for one cycle by their processors, and also all the subsequent requests of that processor.
There are two types of con icts: a bank busy con ict occurs when a bank is still busy when a processor requests it, and causing the request to fail. a simultaneous bank con ict occurs when two or more processors request the same idle bank at the same time. Arbitration is needed in this case. Except for the queue discipline described later, the principle of arbitration of simultaneous bank con icts is to assign a number, called the priority number and denoted by P(i; t), between 0 and n p ?1 to each processor i by a one-toone mapping P, possibly depending on time t. When a simultaneous bank con ict occurs, the processor with the lowest number has access to the bank while the other processors resubmit their requests at the next cycle. At time 0, P is arbitrarily de ned by P(i; 0) = i (0 i n p ? 1) . A variety of di erent priority assignments are proposed below.
1) The static priority 5 The rst one, the static priority, does not depend on time. Processor i has priority over processor i 0 if i < i 0 . Note that P is de ned by P(i; t) = i; 0 i n p ? 1; t 0:
This implies that the priority numbers of all the other processors can be de ned from the priority number of a single one by the relation P(i; t) = (P(i ? 1; t) + 1) mod n p (t 0; 0 i n p ? 1):
This relation will be assumed to be true for the cyclic, rotation, con ict priorities de ned later. It is su cient to know the processor having the highest priority at any given time, in order to know the priority of all the processors.
Priorities of this type were used in an earlier version of the ALLIANT FX/8 and manage memory contention between the ports of a CPU in the CRAY X-MP.
Note that with the static priority assignment, processor 0 always ends rst.
2) The cyclic priority To promote fairness between processors, another policy is to change the priority numbers cyclically. Every c cycles, if processor i had the highest priority, processor i + 1 takes it. For the cyclic priority of period c, P is de ned by P(i; t) = i ? t=c] (t 0; 0 i n p ? 1) which, as before, implies the relation ( 1) .
Alternating priority of this type is used in the CRAY X-MP ( 5] ) between the two CPU's. With this priority assignment, P(i; t) does not depend on con icts before t.
3) The rotation priority In order to avoid giving priority to a processor which has just completed an access, another type of arbitration is interesting, in which the map P depends on the banks. This is also a priority assignment respecting the relation ( 1) . Moreover, at a given bank, after an access corresponding to a request of processor i has completed, the priority moves and processor i + 1 becomes the processor of the highest priority. Hence, for this scheme, called the rotation priority, the priority number of processor i at bank j at time t is given by P(i; j; t) = P(i; j; T ? n j ) ? 1 (T n j t < T n j +1 ; 0 i n p ? 1; 0 j n b ? 1) where T n j is the end of the n j -th access at bank j and P(i; j; T ? n j ) denote the value P(i; j; t) just before T n j .
4) The con ict priority
The con ict priority is similar to the rotation priority except that the priority changes at the end of an access only if there was a simultaneous bank con ict for the access (t b cycles earlier). Thus, P is de ned by P(i; j; t) = P(i; j; T 0 ? n j ) ? 1 (T 0 n j t < T 0 n j +1 ; 0 i n p ? 1; 0 j n b ? 1) where T 0 n j is the end of the n j -th access at bank j after a simultaneous con ict and P(i; j; T 0 ? n j ) denote the value P(i; j; t) just before T 0 n j .
5) The queue discipline
The last policy we consider, called queue discipline, is quite di erent in the sense that the priority of the request depends on its request time. Each bank has a queue and if a request is not accepted at once, the processor does not delay it until the next cycle but the request waits in the queue of the bank. The queueing discipline at each queue is assumed to be rst-in-rstout, but requests which arrive simultaneously to the same queue are served according to the static priority de ned above.
The processors
Each processor sends its requests to the bank at the following rate: when a request is about to be served, the processor which sent it waits t p cycles before sending the next request. This delay is called the inter-request time. The time required for requests to cross the network is included in t p . Let the processor bandwidth be the number of requests sent by the multiprocessor per cycle; if there is no con ict, it is n p =t p . Let the memory bandwidth be the number of memory accesses per cycle; if there are no con icts, it is n b =t b . We assume here the following condition on the bandwidths n p =t p = n b =t b : (2) 7 It means that the bandwidths are perfectly balanced: Without con icts, the requests for the memory do not wait and the rate of occupation of the memory is one. Each processor has vector registers of nite length r to store the vector elements during an operation.
The load
We will consider a vector multiprocessor executing a vector operation. The load is shared between the processors. It can be assumed without loss of generality that the vector length L (the number of the elements in the vector) is a multiple of n p r. If this were not the case, the operations on the remaining elements would be executed at the end with an execution time small enough compared to the total execution time to be negligible. Each vector argument of the operation being performed is divided into n p subvectors having the same number of consecutive elements. The i-th processor executes the operation on the i-th subvector of each of the arguments. The vector elements are loaded from the memory into the vector registers of the processor. The register length r is limited so that the access to a subvector is made by slices, also called blocks, of r consecutive elements. The i-th processor requests the n-th slice of its subvector of each of the arguments before moving on to request the (n + 1)-th slices. Unless stated otherwise, the stride of the vectors is assumed to be one. Between the requests of two blocks, the processor waits a xed number of cycles, called the inter-block time. It will be greater than the inter-request time t p , more exactly a multiple of t p , denoted t p where is an integer.
The processors are assumed to be synchronized at the beginning of the execution.
3 General analytical results
The following results are independent of the chosen policy. They explain the choice of the range of two random parameters: the vector patterns and the inter-block time. Before them, let us introduce the main time metrics which will be used later.
The multiprocessor system executes vector operations as described in Section 2. Let T 0 be the total execution time for a vector operation without con icts. This is the minimum value that the total execution time can attain. Given T 0 , the total execution time T is characterized by the total delay R which must be added to T 0 to obtain T.
De nition Let the con ict rate, denoted t c , be t c = R T where T denotes the total execution time, R = T ? T 0 the total delay due to con icts, T 0 the total execution time without con ict. It is easy to obtain an analytical expression for T 0 .
Proposition 1 The total execution time without con ict can be expressed as
where v is the number of vectors of length L , r is the register length, t p is the inter-block time, n p is the number of processors, t p is the inter-request time and t b is the memory occupation time.
Proof
To execute the vector operation on v vectors of length L, each processor has to request v L np vector components, i.e. to perform v L npr vector register stores. The time required for one such store, without con ict, is ( +r?1)t p , except for the rst one which is t b + (r ? 1)t p . This gives the result.
In the analysis, the time evolution of the system can be described by the following time process: (W(t) = (W i (t); 0 i n p ? 1); t 2 IN) where W i (t) = (w j ; 0 j n b ? 1) where w j is the non negative memory occupation time which remains at time t for processor i at bank j. This is called the residual occupation time process. The process W(t) is periodic from a certain instant to the completion of the rst processor because (W(t); t 2 IN) is a Markov chain on a nite set with deterministic transitions. Thus, the evolution of the execution is divided into three distinct phases: a startup phase, a periodic phase called stationary and a completion phase. In this context, a desirable solution would be a con ict resolution scheme with a small startup phase and a stationary phase without con icts. We will prove in Section 4 that this is not the case using the rotation priority, and counter examples can easily be found using the static priority for some possible starting addresses and inter-block times.
The starting memory addresses.
However, for a special vector pattern, the total delay is minimum for all con ict resolution schemes.
Remark 1 When the rst component of each vector is stored in the same
bank, for all con ict resolution schemes, for any register length which is a multiple of n b , the total delay is R = (n p ? 1)t b independent of inter-block time. Note that (n p ? 1)t b is the time necessary for processor n p ?1 to have access to the rst bank because all the processors request the same bank at time 0, due to the fact that n p r divides L and n b divides r.
Proof.
The key fact is that, if the vectors have the same starting addresses, for example 0, each processor has the same sequence of addresses to access: 0; 1; : : : ; (r ? 1) mod n b ; : : :; 0; 1; : : : ; (r ? 1) mod n b ; r mod n b ; : : : If n b divides r then (r ? 1)mod n b is n b ? 1 and each processor has access to the banks cyclically. After an initial delay of it b for processor i (0 i n p ) to access bank 0, the processors request data in each bank cyclically without any con ict. It is true, for any value of the inter-block time and for any con ict resolution scheme.
In the previous case of vector pattern, the stationary con ict rate is zero. Let us present cases where the con ict rate is high even for relatively small architectures. We take this example because it is commonly thought that the degradation due to con icts increases with the size of the architecture. We consider a 4-processor, 4-bank computer using two di erent con ict resolution schemes. For a register length of 4 and an inter-block time of 2t p , the con ict rate using the static priority can reach 33% for some vector starting addresses (see Figure 2 ) . The rotation priority is also bad for small values of register lengths and inter-block times: When the register length is 4 and the inter-block time is t p , the con ict rate reaches 33% for some vector starting addresses (see Figure 2) . It shows that the performance of the memory is highly dependent on the starting memory addresses of the vectors and it is necessary to consider exhaustively all the vector patterns in a study of the con icts. The inter-block time.
11
The latency between the load of two blocks is random. The in uence of its value on con icts must be taken into account. First, remark that Remark 2 For all con ict resolution schemes, if n b , for every r which is a multiple of n b and all possible vector starting addresses, the total delay is R = (n p ? 1)t b : Proof.
As L is evenly divisible by n p r and r is evenly divisible by n b , each processor requests the same bank at time 0. As mentioned above in remark 1, for every policy, the delay experienced by processor n p ? 1 when it rst accesses this bank is (n p ?1)t b . However, the processors experience no delays for the remaining addresses of the rst block. By the time that processor 0 is ready to make its rst request in the second block, all other processors will have nished their rst block of requests. To see this, remember that processor 0 makes its rst request in the second block at time (r ?1)t p + t p , whereas all processors nish their rst block requests at time n p t b + (r ? 1)t p . Since n b and n b t p = n p t b , t p n p t b . Hence, at the time of the rst request of the second block, no other processors are occupying the memory banks, so no waiting occurs. Thereafter, processors access the banks successively.
Remark 2 proves that a solution to avoid con icts would be to force the inter-block time to be greater than n b t p . However, this inter-block time could be larger than the delay due to con ict, and the solution would not be attractive. We will see in Section 4 that, for the rotation priority, the mean delay due to con icts, taking an inter-block time value at random between t p and (n b ? 1)t p , is, for large vector lengths, vst p (n b + 1)=6 while the delay due to an inter-block time n b t p is vst p n b , which is near six times larger. For this reason, we will study the memory performance when the inter-block time value is between t p and (n b ? 1)t p , which are all the values creating con icts.
Analysis of the rotation priority
An analysis of the rotation priority is presented in this section. The aim is to calculate the total delay due to con icts as a function of the characteristics of both the architecture and the reference streams and to use it to study the in uence of the numerous parameters.
It is easy to see (cf. the appendix) that the startup phase has (n p ? 1)t b cycles, the stationary phase has at least s ? 1 = L=n p r ? 1 full periods of v(r ? 1 + )t p + R stat cycles, where R stat is the delay due to con icts for a period of the periodic phase. Note that a period is the time necessary for each processor to store all the vector registers once. The completion phase has (n p ? 1)t b cycles. The total delay due to con icts R( ; p) is then given as a sum in the following proposition. The rst term is the delay during the startup phase, the following terms are the delays during a single period of the stationary phase, the last term is the delay during the completion phase. The proof is given in the Appendix. In fact, we have given this delay as a function of the two random parameters: the banks of the rst component of the vectors and the inter-block time. On one hand, the addresses of the rst component of the vectors satisfy very simple assumptions: they are independent and uniformly distributed across the banks. On the other hand, the latency t p between the loading of the registers is not a constant in the system. In order to study con icts, we assume it is distributed randomly 13 over all possible values which cause con icts. Remark 2 in Section 3 proved that con icts occur only when is between 1 and n b ? 1.
The aim of the two following results is to give statistics on the total delay: average, best and worst cases. 
Proof.
The rst assertion, which is contained in remark 2, follows immediately from Proposition 2.
As a 0 ; a 1 ; : : : ; a v are independent and uniformly distributed random variables on the set of the banks, it is easy to see that the relative addresses Proof. The minimum follows readily from the expression for R( ; p).
For the maximum of R( ; p), we can prove the result using an argument of convexity.
Note that the expression of R( ; 0; : : : ; 0) is a consequence of remark 1. If n b does not divide r, this is no longer true.
This analysis is now used to study the in uence of the di erent architectural parameters, in order to guide the design of the vector machines. We will consider the register length and the size of the architecture.
In uence of the register length
First we discuss the in uence of the register length, which appears to be a key parameter. We present a monotonicity result, which suggests that large register lengths are preferable.
Let us write, from formula ( 3), the mean value of T 0 , when is randomly chosen in f1; : : :; n b ?1g, and rewrite the formula ( 5) We just have noted that the mean con ict rate Et c can be expressed as a summation P ;p t c ( ; p) which is a complicated expression. To derive further results on the in uence of the parameters, it is desirable to have more explicit formulas for Et c , even if they are asymptotic, for example for large L. The next proposition gives such results in the case where v = 2. 
The precision of these bounds is given by M ? m See the proof in the Appendix. The main result of the analysis is that the con ict rate increases linearly with the number of memory banks and decreases with the register length as 1=r, in the limit as L approaches +1. The bounds are useful to estimate the accuracy of the rst term of the expansion.
Remark.
The quantity M?m 2 is, in the rst order, a decreasing function of r, independent of n b and is less than 1% when r 33. Hence these asymptotic functions are interesting if r is su ciently large because M?m 2 is not too big (see Fig. 3 ). ( 10) and ( 11) .
The threshold r , the smallest r such that Et c is less than a given value , is interesting from a practical point of view. In uence of the size of the architecture Using Proposition 6, let us examine the in uence of the architecture size. L is assumed to be large so that ET and Et c are reduced to their asymptotic forms given by relations ( 6) and ( 7). The register length is xed. Hence Et c depends only on the number of banks and is a linear increasing function. ET is inversely proportional to the bandwidth n p =t p and a function of the number of banks which has the form n b + . These results should be compared to those of Bailey 1] , which give an decreasing performance as the square of the number of memory banks . Therefore with an interconnection network of the same size 2n 2 , a 2n-processor, n-bank computer has better performance than an n-processor, 2n-bank computer with the same memory bandwidth.
The other priorities
The other con ict resolution schemes appear to be more di cult to study analytically than the rotation priority, because the delay is no more the same for each processor. Hence, simulations are used to study exhaustively all vector patterns. The simulator, which is called MEVAMP and written in C, is based on an explicit description of the behavior of the system as a function of time. It determines the total execution time and the con ict rate at the end for every set of vector starting addresses. The principle is to construct a time dependent process describing the state of the system big enough to be a Markov chain: the state of the process at cycle t depends only on the state of the process at cycle t ? 1. The process used here is W(t), which was introduced in Section 3. Moreover this evolution process has the following attractive property: given the vector starting addresses, the evolution process is deterministic, that is, the transitions are deterministic. Hence it is easy to simulate it exactly.
In this section, we present the results obtained by the simulator for all con ict resolution schemes, except the rotation priority where the analytical expressions are used. These results are presented in graphical form.
In order to obtain useful simulation results, it is crucial to control the range of the di erent parameters. We present them below.
Framework of the simulations.
The parameters of the vectors are their number, their stride, their length, their starting addresses. The number of vectors is xed at two, in order to minimize the combinatorial complexity of the system. The stride is one. The vector length is xed at 8192, which seems large enough to be signi cant and will be discussed later. All possible starting addresses are explored.
The parameters of an architecture are the number of processors, the number of banks, the memory occupation time, which determines the interrequest time (by relation ( 2) The simulations give the exact determination of the total execution time for these architectures for every vector starting addresses and every interblock time involving con icts. We derive the mean con ict rate, denoted Et c , (called also con ict rate if it is not ambiguous), when the starting addresses and the inter-bloc time are randomly chosen, and study it as a function of the register length for a given architecture and a given con ict resolution scheme. The results are presented in Figure 5 .
Conclusion of the analysis.
For a given architecture, the con ict resolution schemes are compared. For a given con ict resolution scheme, the in uence of the architectural parameters is described. The following insights can be derived from a study of Figure 5 .
In uence of the con ict resolution scheme.
19
In comparing the strategies, note that the static priority is not the best: the con ict rate of the static priority is twice the con ict rate of the rotation priority (except for small register lengths {4, 8 and even 16{ where the rotation priority has a high con ict rate). For the version of the ALLIANT FX/8 with vector registers of length 32 and a static priority for solving memory conicts, the con ict rate for accessing two vectors of stride one and length 8192 is 0.035. For the same architecture, the other con ict resolution schemes exhibit smaller con ict rates. The con ict policy gives the best results as shown in Figure 4 . policy static cyclic 1 queue con ict rotation con ict rate 0.0354 0.0262 0.0326 0.0215 0.0260 Fig. 4 . In uence of the con ict resolution scheme for a 8-processor, 4-bank computer with register length 32 such as the ALLIANT FX/8.
Secondly, except for the static priority, for large register lengths, the strategies have similar con ict rates. This leads us to examine more closely the in uence of the parameters.
In uence of the register length.
For all policies, the con ict rate is rapidly decreasing as the register length increases for r greater than some value. The reason is that, independent of the con ict resolution scheme, the number of con icts decreases as the register length grows. For example, the analysis of the rotation priority shows that ER ? (n p ? 1)t b is proportional to the number of con icts 2s ? 1 (when n b divides r). For the ALLIANT FX/8, when the register length is multiplied by 2 (respectively 4), the con ict rate for accessing two vectors of stride one decreases from 0.035 to 0.034 (respectively 0.020) (see static priority in Figure 5 ). When three vectors of stride one are accessed, the con ict rate drops from 0.075 to 0.044 (respectively 0.024) when the register lengths are doubled (respectively quadrupled) (see static priority in Figure 7) . A second observation concerns the con ict rate for small values of the register length. For small architectures (4-processor, 4-bank and 8-processor, 4-bank), the con ict rate decreases as r increases. However for larger architectures, there is a maximum on the curve representing the con ict rate as a function of the register length. The bigger architecture, the larger the peak. Furthermore, the position of the peak moves to the right as the size of the architecture increases. This is di cult to explain. Thirdly, note that the queue-discipline is good for any register length: the con ict rate is low for large register lengths and this policy has one of the lowest maxima of the con ict rate, especially for the biggest architectures: it is the best for 16-processor, 8-bank, 8-processor, 16-bank and 16-processor, 16-bank computers. It is an attractive solution for all the register lengths.
In uence of the size of the architecture.
Concerning the in uence of the architecture, a similar behavior to that of the rotation priority is also observed for the other strategies: the con ict rate depends especially on the number of banks. The cyclic priority and the queue-discipline are good examples (see Figure 5 ). The architectures are generally such that a 2n-processor, n-bank computer performs better than an n-processor, 2n-bank computer which itself performs better than a 2n-processor, 2n-bank computer. Note the low con ict rate, given large register lengths, for the 2n-processor, n-bank computers. The ALLIANT FX/8 has this type of architecture.
Extension of the results
In this section, some extensions of the simulations with MEVAMP are presented. While the simulations presented in the last section indicated the in uence of the register length, the strategies and the size of the architecture on performance, these additional simulation results allow us to study the inuence of some other parameters: vector length, the number of vectors and the length of the vector stride.
The stationary phase
We check whether the con ict rate presented for the vector length 8192 in Section 5 is a good approximation of the stationary con ict rate. It depends on the length of the startup phase, which is calculated by simulations, as is the stationary con ict rate, using the framework presented in Section 5. The values of the length of the startup phase are not presented here, but they show that, except for the rotation priority, for large values of the register length, a vector length of 8192 is not su cient to characterize the stationary 21 phase and even to attain it. Stationary con ict rates for accessing two vectors of stride one for a variety of policies and architecture are presented in Figure  6 . They are generally lower than the con ict rates for a vector length of 8192 (compare Fig. 6 to Fig. 5 Moreover the stationary con ict rate lim L!1 t c ( ; p) can be derived and from exact expressions for its mean, it is possible to derive bounds and asymptotics for two vectors as seen in Proposition 7 .
The case of three vectors
To execute a vector operation, at least three vectors must be loaded or stored in the memory. It is therefore interesting to study the in uence of the number of vectors accessed by the processors on the performance. Unfortunately the exhaustive simulations, for more than three vectors, become quite tedious, because the possible combinations of the starting addresses grows as (n b ) v . A more limited set of parameters were simulated for three vectors ( see Fig. 7 ) .
For the rotation priority, the mean delay due to one period of the stationary phase is proportional to the number of vectors (see Proposition 7). This is important because, for this policy, the stationary phase is quite representative of the total execution.
For the other policies, the con ict rate for accessing three vectors is nearly the same as the con ict rate for accessing two vectors for large register lengths or even smaller. However, for small register lengths, the con ict rate is higher for three vectors than for two. Moreover the peak phenomenon, which was particularly noticeable on large architectures when two vectors were accessed, tends to disappear. The con ict rate becomes a non-increasing function of the register length for most of the architectures studied except for the largest ones. This shows that large register lengths yield better performance. All of the policies considered, except the static one, exhibit better performance than the rotation priority when the register lengths are large.
The case of vectors of non unit strides
In order to study the in uence of the stride of the vectors, simulations were performed to compute the con ict rate for two vectors of length 8192 and varying strides for every policy and every architecture as a function of the register length. Experiments were performed for the case when the rst vector has stride one and the second has stride two; and for the case where both vectors have stride two.
The simulations show that the mean con ict rate is higher than for vectors of stride one: for the architectures studied, it can reach up to 0.50, which we consider to be high.
Notice that in this case the con ict rate is an increasing function of the register length. Moreover the in uence of this parameter is quite strong. For example, for the ALLIANT FX/8, the mean con ict rate for accessing two vectors of stride two is 0.24 for a register length 32 (0.15, for a register length 4 and 0.43, for a register length 128).
The in uence of the architecture is di erent for small and large register lengths. For small register lengths, the con ict rate is two or three times higher for the biggest architectures studied than for smaller architectures. For large register lengths, the con ict rate increases by 10 to 20 per cent increasing the size of the architecture in the range of sizes studied.
7 Conclusion
The purpose of this paper was the analysis of the regular reference streams that occur in the context of basic vector operations on shared memory multiprocessors with nite vector registers. This allowed us to estimate the degradation of memory bandwidth due to memory con icts. The e ects of these con icts on memory performance were studied for a variety of conict resolution schemes, both those that are well-known and some new-ones presented here.
Analytical results are presented for one con ict resolution scheme, the rotation strategy. For the other con ict resolution schemes, exhaustive simulations were performed for all vector patterns when two vectors are accessed. These results show the in uence of a variety of parameters on the con ict rate, our primary performance measure. The two main parameters studied were the choice of con ict resolution scheme and the register length. A good choice of strategy can reduce the con ict rate by a factor of two. In particular, the static strategy exhibits poor performance compared to the other strategies.
Furthermore, we observed that when the register length is large, the conict rate is considerably reduced for vectors of stride one, which represent the main reference streams. The rotation strategy, analytically studied, exhibits good performance for a large range of register lengths. In this case, the con ict rate is inversely proportional to the register length. As shown by analytical expansions, the con ict rate grows linearly with the number of banks, the only architectural parameter upon which it depends. Our simulations show also that the stationary con ict rate, as the vector length tends to +1, is less than the con ict rate for a xed vector length. Owing to the explosion in the number of the vector patterns, the entire range of architectural parameters was not simulated for the case where three vectors are accessed simultaneously. However, limited experimentation suggests that the system performs better when three vectors are accessed than when two vectors are accessed simultaneously. 
