Please use publisher's recommended citation. 
Contents

Introduction
TypicaIly, as the number of processors increases in a shared memory multiprocessor, memory access times also increase due to contention on a common communication path or at the shared memory. This increased access time limits the additional processing (either the size of the problem or the speed of execution) which can occur as additional processors are allocated to a program. Kendall Square Research introduced the KSRl system in 1991 [Ken911 with an architecture and distributed memory scheme that makes it possible for average memory access times to remain fairly constant as the number of processors grows. The architecture is based on a ring of rings of processing cells. Each processing cell has a 64-bit microprocessor and its own local memory that is managed as a cache. Up to thirty-two processing cells are connected on a level 0 ring. Up to thirty-four rings may be connected in a level 1 ring, so that the KSRl may contain as many as 32*34, or 1088, processing cells.
The KSRl has a shared memory programming environment, but all memory is contained in the caches of the processors. A valid copy of a data item must exist in the local cache of the processor in order to be accessed. Attached to each processor is the ALLCACKE Engine (ACE), which is the distributed mechanism responsible for finding a valid copy of a data item, copying it to a processor's local cache when it is referenced by that processor, and maintaining sequential consistency between caches.
One processor is the designated owner of each data item, but this ownership is not bound to any particular processor. When a processor writes a data item, it first obtains ownership of the data item in its local cache. At the time of writing to an item, all other copies of the item in other processor caches are marked as invalid, but the memory space for that data item may remain allocated. If a processor reads a data item and a valid copy is not available in the local cache of that processor, then a read-only copy of the data item is obtained via the ALLCACHE Engine. Many processors may have a valid read-only copy of a data item in their local cache.
Each subsequent read of the same data item will be to the local copy and will require a minimal amount of time.
Repeated memory accesses to the same, shared, read-only data item require a fixed, minimal amount of time to access. Therefore, an important factor in keeping average memory access times small is to discover techniques for increasing the probability that a processor will find a copy of the shared item in its local cache. Three features of the memory architecture which move a read-only copy of a data item to a local cache prior to a request are the t w o programmer options poststore and prefetch, and one architectural feature automatic prefetching.
When a variable is updated by a write, a poststore by the writing thread will cause a valid read-only copy of the variable to be sent to all other processors which have a memory location allocated for that variable. The tradeoffs in using poststore have been studied [FtSW+93] . In a lightly loaded system, poststore can be efTective in reducing memory access times. In a heavily loaded system, the cost to perform the poststore can be larger than the cost for copying the variable into the local cache at the time of access. Also, when the read-only copy of the variable sent by the poststore command is received by a processing cell, the variable may not be updated if the cell is busy making other memory accesses [Kengl] .
The prefetch command allows a thread to request a valid copy of a data item before it is actually required by the thread. The prefetch command has been used to study the data rate for multi-ring memory performance [DHT93] .
Automatic prefetching occurs when a processor has space allocated for a particular data item, but does not have a current valid copy of it. If a different processor makes a request for this data item, it is possible for this processor to acquire (Le., "snoop") a valid read-only copy of it as the response to the request passes by on the rings. Automatic prefetching can greatly reduce the average time for a thread to acquire data items, since its occurs before the processor requests the data items. It also allows several processors to acquire read-only copies of a data item in parallel. Because of the automatic prefetching, the time required to obtain a read-only copy of a data item does not depend simply on the distance from the owner of the data item, but also depends on the placement and number of other processing threads which share the same data item.
Since the KSR is a ring of rings, the method of communication between rings is an important factor to consider when a large number of threads share data. Because of the unique architecture, the time to communicate between rings does not depend only on the distance between rings, but also on the type and patterns of data access.
The focus of this paper is to examine the combined effects of automatic prefetching and thread placement in reducing average memory access times for shared data in a multi-threaded application. The results indicate that strategic thread placement across multiple rings of the KSRl can substantially reduce memory access time for shared data items. Other observed, but not reported, behavior, is that multiprogramming of the processing cells can reduce the measured access times for individual threads due to the overlap of fetch times with the time that a thread is context switched.
The goals of this paper are: 0 to describe key features of the KSR memory architecture (Section 2), to design and execute a series of experiments which examine the effects of placement of threads on memory access time (Section 3), 0 to identify programming techniques and thread placement options which can help to minimize memory access time (Section 4), and to summarize the findings and to outline how these results can assist applications programmers in optimizing their codes for the KSR multiprocessor (Section 5). A memory request that is made for a subpage must include the state that the subpage will hold after it is acquired. The actions taken by any processors which hold a descriptor for that subpage depend on the requested state of the subpage. If a request is made to the ownet processor for a read-only copy of a subpage which is currently held in the exclusive state, then the owner state changes to the non-exclusive state before the request is satisfied. The state of the subpage in other local caches which hold a descriptor for the subpage is not directly affected. It is possible for the state of the subpage in other local caches which hold a descriptor for the subpage to also change to read-only due to a copy being acquired through automatic prefetching .
Before a processor writes to a subpage, it must first obtain a copy of that subpage in exclusive owner state. At the same time, all other copies of the subpage in all other local caches change This state information specifies:
1. whether or not the owner of the subpage is on this ring, 2. whether or not there are valid read-only copies on this ring, and 3. whether or not there are valid read-only copies on other rings.
The state information is sufficient for an ARD to know whether or not a request can be satisfied on its local ACE:O ring. If a request arrives at the ARD from its local ACE:O ring, then it will be passed to the higher level ACE:1 ring if and only if it cannot be satisfied at its lower level ACE:O ring. If a request arrives at the ARD from the higher level ACE:l ring, then the ARD will extract the request from the higher level ring and pass it to the lower level ACE:O ring if and only if it can be satisfied on its ACE:O ring.
The total memory capacities and hardware latencies for data transfer specified by the man- 
A u t o m a t i c Prefetching
When there are many shared subpages in the system, the ALLCACHE architecture allows subpages to be copied prior to a request through automatic prefetching. In automatic prefetching, when a copy of a subpage is sent through the search engine to satisfy a request, any processor whose cache has a descriptor for that subpage which is invalid may acquire (Le., "snoop") a read-only copy as the subpage passes by on the ACE:O ring. Automatic prefetching takes place as long as the processor is not "too busy" performing other memory accesses [KenSl] . Automatic prefetching is a powerful mechanism which reduces memory access time in applications with a high degree of read-only sharing.
As an example of how automatic prefetching can reduce access time, consider the KSR1 system as illustrated in Figure 1 . Suppose that the owner of a subpage is located at cell 0, and that cell 1 and cell 2 require read-only access to the same data. Suppose that cells 1 and 2 both have a page allocated in their local cache for the data, and that the state of each requested subpage is invalid. Thus, a descriptor exists for every subpage to be requested, but a valid copy of the subpage does not exist on cells 1 and 2. Suppose that processing is such that it can be guaranteed that the thread in cell 2 will access the data before the thread in cell 1. As cell 2 makes each request on the ACE:O ring, the owner (cell 0 ) will respond to it. When the response passes by cell 1, it will see it. Since cell 1 has a descriptor of the subpage allocated, it will make a copy of the subpage to its local cache. The response message will not be delayed and will be passed to cell 2. Cell 2 will acquire the subpage and remove the message from the ring. When cell 1 finally accesses the data, it will find a valid copy of the subpage in its local cache, and will not place a request message on the ACE:O ring. The memory access time for cell 1 will be minimal.
If the placement of the threads i s changed in this example, then the benefits of automatic prefetching will not be seen. For example, if the thread in cell 1 accesses the data before cell 2, then cell 2 will see the request message on the ACE:O ring. The request will pass to cell 0, which will respond to it. However, cell 1 will see the response message and remove it from the ACE:O ring. Cell 2 will not see the response. When cell 2 finally accesses the data, it will not have a valid copy and will have to request a copy of each subpage through the ACE:O ring. The average memory access time for the two threads is much higher because automatic prefetching is not performed.
In general, the order of data access is not known a priori in a multiprocessor environment.
However, because of increased latency across different ACE:O rings and delays introduced by the ordering of the cells on the ring, it is possible to increase the likelihood that automatic prefetching will occur through strategic placement of owner and reader threads across different ACE:O rings. The combined effect of automatic prefetching and the placement of processing threads is the focus of the experiments in the following section.
Experiments
Workload Description
Four suites of experiments examine the effects of the number and placement of processing threads. A synthetic workload is constructed which is executed in each suite of experiments.
Two types of processing threads are used in the synthetic workload. An owner thread has the task of writing each subpage in its portion of the data set, 50 that it has the only valid copy in memory of the data set (i.e., is the owner of each subpage) at the start of each experiment.
A reader thread has a descriptor for every subpage of the data eet, but these descriptors will be made invalid when the owner thread writes to the subpage. A reader thread requests a read-only copy of each subpage in its portion of the data set after the owner thread has written the entire data set.
A preliminary experiment, Experiment 0, illustrates the performance in the case that no reader threads share data. In Suite I through Suite IV, all reader threads share a single large data set. The synthetic workload is designed to systematically measure the average access time per subpage for the reader threads under a variety of owner and reader thread placements.
The workload performs the following steps:
0 Initialization Phase (executed at the beginning of each suite of experiments)
1. A number of reader and owner threads are spawned, each of which binds to a unique processor for the duration of the experiments.
2.
Each reader thread and owner thread reads a predetermined portion of the data set.
Measurement Phase (executed for each experiment in the suite of experiments)
1. Each owner writes its portion of the data set.
2. A barrier synchronization is performed for all threads.
3. Timing begins for each reader thread.
4.
Each reader thread sequentially reads its portion of the data set.
.
Timing ends for each reader thread.
The Initialization Phase represents the overhead required for spawning threads, binding them to a processor, and allocating pages of local cache memory for the data set. Initialization
Step 2 ensures that each local cache has a valid descriptor of every subpage in the data set so that all subsequent accesses to a subpage require only data movement, and do not require local cache memory allocation, The size of the data set for Experiment 0 depends on the number of reader threads, but is at most 16 MB. The size of the data set for Suite I through Suite IV is 50K subpages (6.4 MB). The entire data set fits into the local cache of the owner, so that no disk accesses are required during the measurement phase. Similar experiments on data sets of other sizes show similar results as long as the data set fits in the local cache and is significantly larger than the data subcache.
The Measurement Phase is repeated for each experiment in the suite of experiments. Measurement
Step l sets the state of each subpage in each owner thread to exclusive owner and sets the state of each subpage in all other threads to invalid. The first access by a reader thread to a subpage will cause the state of the subpage to become non-exclusive owner in the owner thread, but the owner of the subpage does not change during the measurement phase. During the measurement phase, one word in each subpage is read, so that one entire subpage is copied for each read operation. This is the maximum rate of data copying possible, and emphasizes the effects of thread placement. Timing is done using the pmon library call from within the thread code.
A system library call is used for the barrier release mechanism in Measurement Step 2 which synchronizes the reader threads. In the barrier release mechanism, a master thread holds a lock for one or more slave threads. The master thread waits until all slave processes reach the barrier.
Then, the master releases all slave threads a t the same time. The barrier release performs as follows:
1. While waiting on the barrier, all slaves spin on a shared location in memory, the "go signal".
2.
At the time of release, the master updates the go signal. This sends an invalidate to each slave thread.
3. The go signal is invalidated at each slave thread. The next attempt by the slave to read the go signal causes a request to be issued on the ring. The experiments and the synthetic workload are specifically designed to analyze how memory access times can be improved when only thread placement is varied. The memory access pattern is simple so that the behavior under different thread placements is emphasized. Even though this is a simple access pattern, many application codes contain similar patterns. For example, a database update followed by the execution of a number of client programs which read and use the latest value exhibits such an access pattern. Even when application codes contain more complicated memory access patterns, the results shown here illustrate that thread placement is a factor to consider for improving the performance of application codes.
The suites of experiments progressively illustrate the effects of placement of reader and owner threads on the KSR. The performance metric of interest in all experiments is the average read time per subpage. All experiments were run on the KSRl system as illustrated in Figure 1 . All results presented are averaged over at least 5 runs of the same synthetic workload.
Experiment 0
The goal of Experiment 0 is to identify the performance when no automatic prefetching or filtering by the ARD occurs, and all reader threads access the data from the same common owner.
The methodology of Experiment 0 is to eliminate the advantages of automatic prefetching by partitioning the data set among the reader threads, and measuring average access time per subpage as the number of reader threads varies from 1 up to 63. The owner thread is placed on Ring A in cell 31. The first 31 reader threads are placed on Ring A in processor order. The next 32 reader threads are placed on Ring B in processor order.
The data set is divided into disjoint subsets of size 2K subpages each, and each reader thread accesses a unique subset of the data set. Each reader thread reads the same number of subpages, irrespective of the total number of readers, in order to compare with the experiments in Suite I through Suite IV. The performance metric calculated is the average read time per subpage. The total size of the data set which is read is equal to the number of readers times 2K subpages. When 25 readers are executing, the size of the data set is 50K subpages, or 6.4MB.
When 63 readers are executing, the size of the data set is 126K subpages, or 16MB. This size is small enough to ensure that the entire data set fits into the local cache of the owner thread (32 MB), thus eliminating the effects of paging to disk. The results of Experiment 0 are shown in Figure 2 . When the number of readers i s larger than 32, then at least one reader is placed on Ring B. The graph in Figure 2 shows the average access time per subpage for readers on Ring A, the average access time per subpage for readers on Ring B, and the overall average access per subpage for all readers. The graph shows that the owner thread can satisfy requests in nearly constant time until the owner thread sakurates at roughly 8 reader threads. As additional reader threads are added the average access time per subpage increases almost linearly.
Each reader thread accesses each subpage in its unique subset of the data set exactly one time, so that each reference to a subpage results in one request being sent to the owner. As the number of reader threads increases, the number of requests increases proportionally, the FIFO extract buffers used for extracting messages from the ring at the owner fill, and requests must be denied. The denied requests circulate around the ring for another try. Thus, queueing effects are wen, and the average access times per subpage increase proportionally to the number of reader threads. This is consistent with the behavior of an M/M/c model of the system as
The performance in this experiment is a worst case example. No data is shared among the reader threads, so no automatic prefetching occurs, and queueing effects are maximum. Also, since no two requests are for the same subpage, the ARD on Ring B cannot filter requests from Ring B destined for the owner on Ring A. When all readers share a global data set, the effects of automatic prefetching and the filtering by the ARD are introduced. Average read times per subpage reduce substantially, as shown in the experiments in Suite I through Suite IV.
Suite I
In all experiments in Suite I through Suite IV, all readers share a single large data set. The size of this data set is 50K subpages (6.4 MB). All readers read the entire data set.
The goal of the experiments in Suite 1 is to examine memory access times as the location of the owner of the data set is varied on the same ACE:O ring. The methodology of this experiment is to measure the average read time per subpage as the number of reader threads varies from 1 to 63. The owner thread is placed on Ring A. The first 31 reader threads are placed on Ring A in processor order. The second 32 reader threads are placed on Ring B in processor order.
The location of the owner thread is varied from cell 0 to cell 31 on Ring A.
Figures 3 and 4 illustrate the results of Suite 1. Results were obtained for each possible location of the owner from cell 1 to cell 31, but are not presented for the sake of brevity. In Figure 3 the owner is on cell 1. This figure is characteristic of the performance obtained when the owner thread is placed on cells 1 through 11. In Figure 4 , the owner is on cell 31. Figure 4 is characteristic of performance obtained when the owner thread is placed on cells 12 through 31. When the number of readers is less than 32, all readers are on the same ACE:O ring, Ring A. When the number of reader threads is 32 or more, then at least one reader thread is placed on Ring B. When the owner thread is at cell 1 and at least one reader thread is placed on Ring B, then average access times are constant as the number of readers is increased beyond 32, as shown in Figure 3 . Figure 3 shows the effect of the ARD filtering additional identical requests from Ring B. As the number of readers increases on Ring B, the number of requests for the same subpage also increases. However, the ARD passes to Ring A only a single request for all Ring B threads, so that the additional demand to the owner is only one request, irrespective of the number of reader threads on Ring B.
When the location of the owner thread is varied around the ring, the average read times decrease for all threads on both Ring A and Ring B as the number of reader threads is increased beyond 31. This effect is observed in the experiments when the owner cell is on cells 13 through 31. Figure 4 shows average access times when the owner thread is on cell 31. A t first, this seems paradoxical. When threads are added on a ring which is remote to the owner (Ring B) which make requests for data on Ring A, the average access times for both Ring A and Ring B threads decrease. The location of the ARD, the direction of ring traffic (which is also the order of the barrier release mechanism), and relative placement of the owner thread to the placement of reader threads on Ring B combine to increase the amount of automatic prefetching for threads on Ring A, and decrease queueing effects at the owner thread. The automatic prefetching increases the number of valid subpages which are found in the local cache of each reader thread in Ring A. This causes a decrease in the average read time per subpage. Further, as automatic prefetching causes valid copies of the subpages to be available for reader threads, the number of requests is reduced. Since the number of requests is decreased, demand at the owner thread drops, and queueing effects are reduced. Average read times per subpage are reduced for all reader threads. The amount of automatic prefetching is probabilistic, and depends on the relative rate of execution of each of the reader threads.
In Suite I, the owner is on Ring A. Reader threads on Ring A are on a local ring with respect to the owner of the data. Reader threads on Ring B are on a remote ring with respect to the owner of the data. It was found that placing some number of reader threads on a ring which is remote to the owner improves the performance of reader threads on both Ring A and Ring An interesting observation can be made for the graphs in Figure 5 and the left half of the graph in Figure 4 . In both experiments the owner is on Ring A, and the total number of reader threads increases from 1 up to 31. In Figure 4 all readers are on Ring A, which is local to the owner. In Figure 5 dl readers are on Ring B, which is remote to the owner. Figure 7 shows these two curves on the same graph. There is a crossover point in the graph. When more than 20 reader threads are executing, better performance is observed when all reader threads are on the remote ring as compared to when all readers are on the same ring as the owner. This effect is due to the behavior of the ARD, which filters requests from Ring B and the effects of automatic prefetching by the reader threads on Ring B.
Suite III
The goal of the experiments in Suite 111 is to fix the location of the owner thread, and examine the effects of various placements of reader threads. Cell 31 on Ring A is chosen as the location shown in Figure 9 . It is clear that the placement of the reader threads across the two rings can affect the overall average subpage access time, especially when the number of reader threads is less than the full complement of processing cells.
Suite IV
The goal of this suite of experiments is to demonstrate that further performance improvements can be achieved by increasing the amount of automatic prefetching using a simple programming technique. The performance improvement due to automatic prefetching is more pronounced if this technique is combined with a good placement strategy for the owner threads.
The methodology used in this experiment is that the reader threads do not access the same subpage simultaneously. Each reader thread starts reading at a different point in the data set.
With this staggered readers technique each reader thread has a different access pattern from every other reader thread and the effect of automatic prefetching is more noticeable. 
Interpretation of Results
The results of the experiments show that with strategic thread placement and coding that takes advantage of the architecture, memory access times on the KSRl can remain fairly constant as the number of threads that share a data set increases. A number of implications for applications programmers can be identified: Suite I also shows that additional threads that share a data set can actually improve performance, as long as the threads are placed strategicaily to take advantage of automatic prefetching and ARD filtering.
0 The placement of the owner thread of a data set particularly affects the performance of reader threads that are placed on a remote ring, as shown in Suite XI. 0 Code changes that dlow a shared data set to be distributed among several owners, or stagger the access pattern among readers, can also substantially improve the performance, as shown in Suite IV. This effect is less noticeable as the number of reader threads increases.
Conclusions and Future Work
The key issue addressed in this paper is the impact of thread placement in a multiprocessor system. A measurement based approach is taken. The specific system considered is the Kendall Square Research KSRl. Even though all processors are physically identical, the specific thread placement affects performance due to unique architectural features. A series of controlled workloads is constructed and placed on the system. Various thread placement experiments are conducted and the results reported.
The primary contributions of this paper include: a description of the key features of the KSR architecture with emphasis on the ALLCACHE memory structure; the design and execution of a series of experiments which illustrate the unique memory access behaviors on the KSR1; and the identification of programming techniques and thread placement strategies which improve the performance of the system.
The experiments illustrate several intersting and unexpected results. These findings indicate that several extensions to this work are appropriate. Such extensions include:
e The testing of actual application codes. The controlled workloads considered in this paper are synthetically generated. Although designed to mimic certain application codes, it is necessary to identify and test actual codes to determine the effects of thread placement.
0 The testing on KSRl systems with multiple (i.e., more than two) ACE:O rings. Although it is expected that the results reported here generalize to more rings, experimental verification is appropriate. Also, similar experimentation on a KSRP, with a extra level in the ring hierarchy, is needed.
* Analytic modeling and prediction of the KSRlIKSR2. This work represents a preliminary study of the effects of thread placement. Several graphs which represent various particular aspects of the system are given. These measurement figures should form the basis for the construction and validation of appropriate analytic models.
0 The combined analysis of explicit prefetch (i.e., the programmer option), poststore, and automatic prefetch. This paper concentrates on automatic prefetch. Other papers have concentrated on explicit prefetch and still others have concentrated on poststore. Understanding when each feature is most beneficial would be worthwhile.
e The testing and analysis of the effects of multiprogramming. In this work, a single multithreaded workload is considered. Understanding the effects of multiple multi-threaded workloads with respect to the overall thread placement strategy is desired. 
28.
