Abstract-The static interconnection network topologies in the distributed shared memory systems (DSM) have several limitations. The reconfigurable interconnection networks may reduce the network congestion, network latency and improve the overall performance. However it is necessary to know when the right moment to perform the reconfiguration is and how to implement it. In this paper, we present our study on communication patterns of parallel scientific and commercial benchmark applications on a simulated but realistic DSM machine and their relation to context switching in the operating system. We also propose a reconfiguration scheme that is triggered by the context switches.
INTRODUCTION
Most of the distributed shared memory computers multiprocessor systems today have a fixed interconnection network topology between connecting the different nodes. These static interconnection network topologies, such as tori and hypercubes, still have several connectivity limitations when we scale the systems up. Developing parallel algorithms whose communication patterns closely match a certain interconnection topology of a target parallel system is often a difficult and non convenient task. And although one interconnection topology may have the ideal node distribution for data interchange for a set of algorithms, it may not be well suited to other set, introducing large latencies on the network and hence unacceptable communication delays for other algorithms [1] .
The reconfigurable interconnection networks for multiprocessor systems have attracted a lot of research interests [1] [2] [3] [4] [5] . An Almost all of the published works on reconfigurable interconnects in the past considered only the abstract models of reconfigurable networks [1, 2, 4] and addressed the implementation of programmable interconnects for processor arrays by using FPGAs [3, 5] . With the advent of new photonic devices, such as wavelength tunable Vertical Cavity Surface Emitting Lasers (VCSELs), tunable optical filters and resonant cavity detectors, the reconfigurable interconnects can be implemented via optical technologies, with all the extra benefits commonly associated to optical networks in comparison to electrical technologies.
Although several models of reconfigurable interconnection networks for parallel computer architectures have been already proposed, very few of them were constructed and actual performance gain is yet to be demonstrated.
In our research work, we investigate whether the reconfigurable interconnects can improve the performance of distributed shared memory machine via extensive simulations, and we propose a possible implementation with making use of available low cost optical technologies.
In this paper, we focus our study on communication patterns of parallel scientific and commercial benchmark applications on a simulated but realistic DSM machine and their relation to context switching in the operating system. The paper is organized as follows. Section II describes the architecture of the simulated distributed shared-memory machine and the reconfigurable interconnection network that were chosen for this study. The simulation environment used to run the benchmarks and track the communication patterns is described in section III. The simulation results of the simulations and further discussions are presented in section IV. Finally, the conclusions are summarized in section V.
II.
SYSTEM ARCHITECTURE

A. Distributed shared-memory multiprocessors
In the distributed shared memory (DSM) architecture, the shared main memory is physically distributed among the processors as local memory units and all processor nodes are connected by an interconnection network. Therefore the memory access can be either local or remote depending on where the data resides. The multiprocessors interconnects found in DSMs can be classified according to the topology connecting the different nodes in the system, each with the processing unit and a part of the distributed global memory.
In the bus-based architectures, the interconnection network is a shared bus that is located between the processor's private caches and the shared main memory. This architecture is also referred to as Symmetric Multiprocessors (SMPs) and has been widely used for small and medium-scale multiprocessors consisting of 2 to 32 processors due to its simplicity. But The proposed interconnection network architecture for the DSM system in this paper consists of a fixed base network connecting all the nodes (processors and local memories), arranged in a torus topology. In addition, we will place a certain number of free reconfigurable links (see Figure 1 ) between pairs of nodes that are expected to have a temporal heavy traffic load. These new extra links can be employed as direct point-to-point connections to route the traffic between the busiest processor node pairs in congestion situation, while still keeping the base network for all the other communications. Therefore the network congestion and latency can be reduced significantly. Those direct connections will be alive for a certain interval of time, and then the extra links will be reassigned according to the new traffic measurements.
This network architecture, compared to the other topologies where all links in the network are to be used for the topology reconfiguration, has a number of advantages. It is indeed impossible to disconnect parts of the network when the program is running and it also greatly simplifies the routing and reconfiguration algorithms since the base network will always be available. However, to make optimal use of reconfigurable links, it is necessary to know when the right moment to perform the reconfiguration is and where to put these extra links. We want to focus our research on the occurrence of trigger conditions that can lead to a reconfiguration. One of the events on the system triggering the reconfiguration could be the context switches happening on the operating system. Such a context or task switch can be expected to impose high traffic demands on the interconnection network during short intervals of time and may thus offer a possible communication pattern we can use to solve the question of when to reconfigure.
C. Context switching in the Operating system
The Operating system (OS) is a critical part of a multiprocessor system, because it controls the execution of the applications running and the switches in between. As only one task can be executed by one processor at each time, an OS must create structures that enable it to run multiple tasks concurrently. The basic structure for managing the execution of an application is the process; it provides the necessary resources to make each task as an isolated entity. During normal execution, only one process per processor can be executed. After a certain time interval, this processor can switch to another process; this procedure is known as context switch.
By a context switch, the kernel saves the state of the current running process or thread and then loads the state of the next one to be executed. It needs to save enough information about the current execution so that it can be resumed later. Just after the context switch, the processor will work with a completely different set a code and data, therefore the data in the cache will be invalidated and a communication peak to this processor will occur to fill the caches while new read and write instruction are executed.
Before the OS assigns another process to the running state, it must carry out several steps, some of them involving movement of data from different nodes through the interconnection network to the memory. This includes saving the context of the processor, including program counter and other registers, move the process control block of this process to the appropriate queue in the scheduler, update the process control block of the process selected, update memorymanagement structures and restore the context of the processor, loading the saved data. All these operations will generate a sudden burst of traffic on the network as these structures are moved from caches and memories.
Hence this suggests that there will be high traffic loads after every context switch, making it an adequate trigger for the adaptation of the network. It motivated us to investigate the communication patterns of different parallel applications on the simulated DSM machine during processes execution.
III. SIMULATION ENVIRONMENT
We have established a full-system simulation environment based on Simics [7] , a commercially available executiondriven multiprocessor simulator. Simics is able to simulate complete computing systems, including their operating system and the realistic workloads. The simulator was configured to model a multiprocessor machine based on the Sun Fire 6800 server, with 16 UltraSPARC III processors at 1 GHz running the Solaris 10 operating system. We extended the Simics simulator with an interconnection network module, where we modeled a 4x4 torus network with contention and cut-through routing.
The SPLASH-2 scientific parallel benchmark suite [8] , being a standard benchmark for parallel multiprocessors, as well as the Apache web server 2.0 along with the Apache benchmark application, were chosen as the workload applications for stressing the system under test. A statistics module has been developed and added to Simics allowing us to collect all information about network traffic, context switch data, memory and processor information of the executions.
IV. SIMULATION RESULTS
A. Communication patterns and context switches
We have run a set of benchmark programs (FFT, LU, Cholesky, Barnes, Raytrace . . . etc.) from the SPLASH-2 application suite with default problem sizes, and also installed the Apache Web server 2.0 on the simulated machine. In order to measure the traffic pattern of benchmarking applications, we have included instructions in the simulator source code so that every time, when a packet is injected to the interconnection network, we catch the name of sender node, receiver, size of the packet and time stamp, and save all this information in a log file.
After running the benchmark programs on the simulated system, we obtain all the information referring to the number of packets that flew through the network, including which processors are their sources and destinations, the delay and the size of every packet during execution of every application. Based on this data, we can calculate the total traffic, including the incoming and outgoing traffic, of every processor on measurement intervals of 100 ts. The traffic evolution along with the execution time gives us the traffic pattern, or communication pattern, of an application. For example, Fig. 2 shows the traffic patterns of processor number 1 running the Cholesky algorithm during 20 ms. The total traffic of node 1 is presented in a thick line; the traffic from processor 1 to processor 4 and to processor 6 is given in dotted lines, while context switch happenings are indicated by vertical lines.
We can see that the traffic at processor 1 increases above 1 Gbit/s just after a context switch due to the load/store operations from memory as well as other traffic bursts happening during the context. Besides, the duration of one context is not constant and depends heavily on the scheduling and interruptions happening on the system. Our simulation results also indicate that the traffic from one processor is not divided equally to all other processors. There were always one or two destinations that were getting the majority of the generated traffic during one context and the whole program execution. Those destinations can differ from application to application and at different simulation runs, depend on the algorithm and the scheduling policy of the operating system. For the Apache simulation, the processor 2 and processor 4 got the majority of traffic from processor 1, and total traffic at processor 1 went up to 2 Gbit/s (see Fig. 3 ). We have made an analysis of traffic contribution of the two busiest nodes to the total traffic at every processor for Cholesky and Apache 2.0. We found that those two nodes represented from 80% to more than 90°O of the total traffic of one processor node (see Fig. 4 ), and the busiest node could get from 5000 -76% of the total traffic (see Table 1 ). Table 2 ). A distributed algorithm running on every node computes the routes that packets follow and automatically recalculates the routing tables to incorporate or remove the extra links whenever the topology changes.
To be beneficial from reconfiguration, the measurement duration /At and the time for extra link selection and for switching (tse + tsw as shown in Fig. 5 ) should be much smaller than the average context length, setting this value to 1 ms for our study. 
CONCLUSIONS
We have shown that there is indeed intensive communication between pairs of nodes after a context switch happens in a distributed shared memory system, and only one or two destination nodes get the majority of those traffics. These results indicate that by making use of a reconfigurable network that can add as much as two extra links at each node, this should be enough to improve the communication and remove congestion and large latencies, resulting in an overall reduction in execution time for parallel application. The switching time for the components required to implement such a reconfigurable network would need to be smaller than 1 ms, which is a value achievable currently for low cost optical technologies.
