Abstract
1. Introduction 1.1 Distributed Shared Memory1 [Li] Shared memory multiprocessor offers a favourable programming paradigm of a global address space for parallel programs such that concurrent executing program components can communicate through shared variables. However, building an &dent network connecting all the processing elements can be expensive. Furthermore, such system has poor scalability. 
d~ I I I L U f f d R O p E U B h g P~
On the other hand, distributed memory multiprocessor is more scalable, but its programming is clumsy and difiicult as communications between processor nodes need to be explicitly coded using message passing Figure I ). Therefore, the concept of distributed sharcd memory multiprocessor is developed to take and eliminate some oftheir pitfalls.
The hardware architecture ofDSM multiprocessor is the sameor very similar to that ofdistributed mwnory multiprocessor with the addition of a sdhvarc or hardware layer (a DSM abstraction) to enable the (lVU.44) . When a "OIY access is &&, itwillbefastuifthe information requested is located at the memory module of the local processor node. Howewr, it will take longer ifthe information requested is located at the memory module o f a remote processor nodq though the "sm . for tDsuring data coasistencyovcrtheglobal address space, and the d a t a t l a " like parallel program executing on distributed memory multiprocessor which may be scheduled and partitioned to match the underlying architecture for minimizing time consumption in remote memory m, DSM assumes a shared memory such that application programs running on top of it have no knowledge about local and remote memory accesses. Comequently, its number of remote memory accesses may be more than expected. Several latency hiding techniques mw] listed below have been developed to cater for this situation. prefetching coherent caches relaxed memory consistency multipleantexts Prefetching hides the latency of memory accesses by issuing them in advance and expecting them to be available when the executing program needs them [Sa] . Coherent caches try to reduce cache misses by hardware, and hence less remote memory accesses frequency resulted. Technique of relaxed memory consistency models is by pipelining and buffering memory accesses to hide the latency. Multiplecontexts technique attempts to hide latency by switching between contexts of different program execution components when a latency (remote memoty access or Jynchronizution) is encountered.
Processes versus Threads
To just@ the technique of latency hiding using multipleantexts, context switching time is a determining factor. The context of a process can be divided into two parts: system resources and execution states. Typical system resources associated with a process are addressable memory space, opened files, allocated communication ports, access control information, etc. They are often the large part of a process context especially the memory address space which contains a large buffer called table lookahead4ookaside buffer (TLB) for address space mapping. On the other hand, the context of execution state consists mainly of the processor registers, stack pointer, and program counter. It is often a small part of a process context.
In traditional operating system, a process ( Figure  4u ) supports only a single flow of execution (also called thread). Therefore, switching between different flow of executions requires to save and restore the cor-I ExMhons(.Ll@nd) I responding process contexts which may take a very long time because of the sigruficant system resources involved. In modem operating system, a process In order to compare and a n a l p the effectiveness in latency hiding by MSS, a series of simulation experiments were performed in comparing with the GSS. The simulation results suggest the boundary conditions for which MSS can obtain the best performance which may be useful as the criteria for improving both of the working mechanism of threads and the algorithmic approach of MSS.
Organization of the Paper
This paper is organized into the following sections. Section 2 revisits some well known loop scheduling schemes for shared memory multiprocessor. Then, multithreaded self-scheduling scheme is introduced in Section 3 with an explanation of its working principles and its suitability for DSM multiprocessor In section 4, a simulation model is developed, and some simulation cases are studied in section 5 so as to compare characteristics of multithreaded self-scheduling and guided self-scheduling schemes under different simulation conditions. Lastly, discussions on the simulation results and their implications are presented in section 6, and followed by a conclusion in section 7 Figure 5 , a doall loop of lo00 iterations is scheduled on a four-processor shared memory multiprocessor by GSS scheme. For each job requested by a processor after completing a previous job, a chunk with iteration-size of c will then be scheduled, and c can be calculated by the following equation. There are also some other scheduling schemes which derived from the GSS by further improving the load balancing (Factoring) [Hu] or scheduling overhead (Trapezoid selfscheduling) pz]. One of the common features between all of these dynamic loop scheduling schemes is that they schedule a lot of loop iterations in a chunk at a time. This " m o n feature does not only reduce the scheduling overhead but also allow multiple-contexts (by multithreading) latency hiding technique to be applied.
Multithreaded Self-Scheduling
Although the above self-scheduling schemes are well known as appropriate methods to schedule loops on shared memory multiprocessor, using the same technique on DSM multiprocessor may result in substantial performance degradation due to the large number of remote memory accesses. Consequently, multithreaded self-scheduling Fiss) scheme is proposed in this paper to address this issue.
h executing doall loops with large number of iterations on multiprocessor system, they are often divided into chunks. Each chunk contains a number of iterations andsdxeoutes on an allocated processor node BS a PKKXSS. For example using GSS scheme in figure   6 , a doall loop with loop count of 1000 may be divided into a number of chunks. These chunks can then be scheduled on processor nodes as smaller processes (sub-tasks).
C l p . L A D . . I I h l p -m w m r .~M~n a s n r
A sub-task can further be divided into smaller processes such that each iteration is itself a process. Furthermore, these one-iteration processes may share the same system resources in execution. Hence, it is appropriate to define them as threads and to encapsulate them into a process sharing the same system resources instead. This configuration of multithreaded sub-task supports multiple contexts of threads with efficient thread management operations. It is also the basic chunk defined in the scheme of MSS.
When a sub-task is executing on a processor node, situations may arise that latency is introduced. Two kind of latencies are often common in DSM multiprocessor, namely remote memory access latency and synchronization latency. Executing an instruction may involve some operands, and these operands may be 10-cated at different processor nodes. For example, as described in figure 7, the operands, Ma and Mb, are located at processor node A and B respectively. If the processor node, say A, is not the same node where the instruction is executing, this operand needs to be requested from the remote processor node A. Thus, a remote memory access latency would be expected. Moreover, the two operands, Ma and Mb, are likely to be available at different time t l and t2 respectively. Remote memory acccss latency may be. considered as substantial. However, synchronization latency is difficult to forecast, it may be small or large and varies according to different program behaviours and execution environments. In MSS, only the remote memory access latency is intended to be hidden. Figure 8 ) is based on the cooperative work of DSM server and sub-task's thread scheduler. When a memory access is issued, the DSM server determines the availability of the information requested in the local memory module. If it is in the local memory module (a hir), a local memory access is performed and the current executing thread continues. However, if it is not in the local memory module (a miss), the DSM server resolves the memory access by requesting it from a remote processor node. There are various methods to perform this resolution in different DSM schemes pi]. In MSS, DSM server needs to acknowledge the thread scheduler on a miss, and the scheduler can base on this information to block the current executing thread and allocate the processor to another runnable thread. As the remote access is completed and the information is transferred from the remote memory module to the local memory module, the thread scheduler is acknowledged again such that the previously blocked thread can then be changed to a runnable thread for reallocation. If the time cost for managing the threads is small (or cheap) and there are sufficient number of threads for switching, the remote memory access latency may be effectively hidden. Although only GSS is multithreaded for our simulation study, MSS is a general technique which may be applied to most self-scheduling schemes with reasonable chunk size. Chunk self-scheduling, factoring, and trapezoid self scheduling are all possible be multithreaded. Furthermore, prescheduling can also be multithreaded as long as the chunk size is large enough.
The working principle of MSS (

Simulation model
In order to compare the performance of the multithreaded self-scheduling scheme against the traditional self-scheduling scheme, a simulation model was built and tested b a d on both the MSS and GSS. The corresponding chunks scheduled in MSS are the same size to those of GSS. The difference is only on the behaviour of the chunks. In MSS, each chunk is a multithreaded process, while it is a single-thread process in GSS as depicted in figure 9. For the sake of simplicity in the simulation, the overheads on creating and destroying threads are ignored by the assumption that the number of context switching operations between threads is &ciently large compared with the h e a d creation and destruction operations, hence the total time of context switching operations dominates the overall execution time of a chunk contributed by thread management. This is often the case because thread creation and destruction are mostly the allocation and deallocation of small storage for the execution states @rocessor registers, stack pointer, program counter). These overheads are generally small though it depends on the specific system and thread implementation method. In addition, the number of context switchings is substantial in MSS as remote memory accesses are common in DSM multiprocessor, and each remote memory access triggers at least one context switching (one for W Figure 10 . State Diagram of a T h d hi MSS. blocking this thread, and maybe another one for dispatching this thread later when the remote memory data is available). Furthermore, ~ynchrO&tions between threads are also ignored by the fact that there is no dependency between threads in doall loops.
Intuitively, the trade-off between M$s and GSS is the context switching time and the remote memory access latency. Therefore, several simulation cases were investigated and simulated with varying simulation parameters in order to study this intuition in details.
Simulation Cases
The doall loop used in the simulation is shown in figure 11 . It contains no interdependency between different iterations within the loop, and each iteration is characterized by the portion of ExeOnly (execution time units without memory access) and the portion of in an iteration is assumed to be random because memory access distribution in a set of instructions is determined by the specific application, the coding method of the programmer, as well as the code generation method of the compiler. With such complex factors affecting the memory access pattern, it is almost impossible to forecast when a memory access will be issued. Therefore, the use of random memory access pattern seems appropriate in this simulatil shown. Hence, a relatively small processor efficiency improvement. On the other hand, a very low hit ratio results into a large number of remote memory accesses and relatively insul%cient available threads for effective multiplecontexts latency hiding. Therefore, the processor efficiency improvement is relatively small too. For an optimal point, the hit ratio should not be too high or too low such that the number of threads matches the number of remote memory accesses for the best latency hiding effect. In short, the number of iterations (threads) in a chunk and number of remote memory accesses may require to match in order to obtain the best processor efficiency improvement by MSS.
Effect of Varying Thread Context Switching Time
Referring to figure 17, we can observe that the variation of local memory access latency has no significant effect on the overall execution time difference between MSS and GSS. It can be interpreted as the improvement on latency hiding will not be affected by this parameter. It is logical as the switching of threads is merely determined by the signal from the DSM server to the thread scheduler which in turns is decided by the nature of the memory access (local or remote). If the local memory access latency is large compared with the context switching time, the performance of MSS would be improved further by switching threads on encountering any kind of memory accesses.
Since the number of local memory accesses is large for the default hit ratio (97%), the improvement on processor efficiency becomes insigruficant when the local memow access latencv For the case with fewer processors, the average chunk size is relatively large and multithreading on a larger chunk may result in a better latency hiding effect. However, as the number of processor nodes increases, the performance of both GSS and MSS converges. In one extreme, the chunk size reduces to one when the number of processor increases to 1000. In this situation, the MSS and GSS have no difference as processor nodes in both schemes have only one thread to execute. Of course, the task completion time is faster with an increasing number of processors but the processor efficiency is poorer. Actually, scheduling one iteration at a time is not a good idea as the scheduling overhead is great, a slight modidcation of GSS had also been suggested in [Po] to define somehow a minimum chunk size which may also be beneficial to MSS.
Different Level of Threads
In the above simulation study, we have not assumed any specific threads implementation as well as the details of the DSM. Since their implementation decisions may be closely related, threads (like DSM) can also be implemented at different levels. As we have discussed the implementation choices of DSM before, let us look at the issues on threads now. In practice, threads can be classified as user-level threads and kernel-level threads. User-level threads implementation can result in fast thread management operations because the thread scheduling is in the user space. However, the scheduler has no way to access the information in the kernel and hence only nonblocking type kernel system calls may be used which are usually slower. In contrary, kernel-level threads implementation does not have this problem as the scheduler can obtain the kernel information for thread management but it suffers a serious performance drawback as thread management needs to be perfomed through system calls. Furthemore, the crash of a kernel thread may corrupt the kernel so that protection checking on kernel thread operations has to be performed which is time consuming.
A hybrid kerneUuser-level thread management system based on scheduler activations have been suggested which contains the most benefits of user-level thread and kernel-level thread [An] p a ] . The idea of this management system is that user-level threads are built on top of a kernel entity called scheduler activation. Scheduler activation supports communications between user-level threads and the kernel by notifying the user-level threads of kernel events and vice versa. Therefore, the performance of such threads is good for they are executing in user-level, as well as retaining the functionality of kernel level threads.
Conclusion
From the above performance analysis of MSS and GSS, we conclude that MSS is an efficient loop scheduling scheme for DSM multiprocessor. However, several considerations may be desirable in order to obtain the best performance from MSS.
The context switching time between threads needs to be small compared with the remote memory access latency. For our default simulation parameters, the context switching time of 30% or below of (i.e. below 300 time units) the remote memory access latency can result in 1.8 to 2.6 times processor efficiency improvement or 1.7 to 2.4 times execution time improvement by using MSS instead of GSS.
For the case with context switching time comparable to or larger than the remote access latency, MSS performs poorly. The number of remote memory accesses and the number of threads on a multithreaded sub-task need to be matched for the best processor efficiency. Therefore, calculation of chunk size (number of threads) can be modified from the method of GSS with consideration given to the number of remote memory accesses. Local memory accesses may also be hidden if context switching time is d c i e n t l y small comparing to local memory access latency. MSS is more efficient in the situation where the number of processors in the multiprocessor is scarce and the loop (or total number ofthrearls) is relatively large.
Several issues related to this simulation can be further investigated. A more sophisticated simulation model which considers both the memory locality effect of DSM and some real problem loops can be studied. A machine realization of MSS is now under investigation which would reflect the real execution environment more accurately. Further investigation on applying the concept of MSS to other scheduling schemes is in progress, which may eventually cover the doacross loop and with other dependencies.
