Abstract: This paper presents an investigation into the inter-processor and inter-process communication for real-time computing in multiprocessing systems. A finite difference simulation algorithm of a flexible beam is considered to demonstrate critical interprocessor and inter-process communication issues in real-time computing. Accordingly, issues such as, synchronization, sharing process resources, granularity, scheduling and mapping are explored within the framework of real-time parallel processing and parallel multithreading. The algorithm is also analysed in detail to explore the inherent data dependencies that impact on the inter-processor and inter-process communication.
INTRODUCTION
Inter-processor and inter-process communications play vital roles in real-time high performance parallel processing and parallel multithreading in multiprocessing systems. Therefore, it is essential to explore the real-time performance of multiprocessing architectures and their suitability in implementing an algorithm. The amount of data, the frequency with which the data is transmitted, the speed of data transmission, latency, and the data transmission route are significant in affecting the inter-processor and inter-process communications within the architecture. The first two factors depend on the algorithm itself and how well it has been partitioned. The remaining two factors are the function of the hardware. These depend on the inter-connection strategy, whether tightly coupled or loosely coupled. Any evaluation of the performance of the interconnection must be, to a certain extent, quantitative. However, once a few candidate networks have been tentatively selected, detailed (and expensive) evaluation including simulation can be carried out and the best one selected for a proposed application (Hossain, 1995; Tokhi et al., 1997a) .
In multithread computing, threads reduce overhead by sharing fundamental parts. By sharing these parts switching happens much more frequently and efficiently. Although sharing information is not so difficult, sharing with higher dependency among threads, however, could cause degradation of the overall computing performance.
Typically, applications that express concurrency requirements with threads need not take into account the number of available processors. The performance of the application improves transparently with additional processors. Numerical algorithms and applications with a high degree of parallelism, such as matrix multiplications, can run faster when implemented with threads on a multiprocessor (Nichols et al., 1996) .
When two concurrent procedures communicate, one writing data and one reading data, they must adopt some type of synchronization so that the reader knows when the writer has completed and the writer knows that the reader is ready for more data. Some programming environments provide explicit communication mechanisms such as message passing. The parallel multithread concurrent programming environment in this investigation provides a more implicit mechanism. Threads share all global variables. This offers thread programmers plenty of opportunities for synchronisation. Multiple processes can use any of the many other UNIX interprocess communication mechanisms, for instance, sockets, shared memory, and messages (Gray, 1998) .
Performance of the synchronization mechanism of a multi-processor determines the granularity of parallelism that can be exploited on that machine. Synchronization on a multiprocessor carries a high cost due to the hardware levels at which synchronization and communication must occur (Tullsen et al., 1999) . This paper presents an investigation into interprocessor and inter-process communication for realtime high performance computing. A finite difference simulation algorithm of a flexible beam is considered to demonstrate critical communication issues in realtime parallel computing. Inter-processor and interprocess communication issues, for instance, synchronization, sharing process resources, hardware and task granularity are explored. The algorithm is also analysed in detail to explore the inherent data dependencies that impact on the inter-processor and inter-process communications. Two homogeneous networks comprising (i) nine transputers and (ii) dual Pentium III, are considered. Finally, a comparison of the results of the implementations is made, on the basis of real-time parallel processing and multithreading performance to lead to merits of system design incorporating fast processing techniques for real-time applications.
INTER-CONNECTION ISSUES
In practice, the most common inter-processor communication techniques for general purpose and special purpose parallel multi-processing are as shown in Fig. 1 . These are described below. 
Shared memory communication:
This is the most widely used inter-processor communication method. Most of the commercial parallel computers utilise this type of communication method due to its simplicity, in terms of hardware and software. In addition, as most of the general-purpose microprocessors (CISC or RISC) do not have serial or parallel communication link, these devices can only communicate with each other via shared memory. However, this is one of the slowest interprocessor communication techniques. A major drawback of this technique is how to handle reading and writing into the shared memory. In particular, while one processor (say, P1), reads/writes into the shared memory, the other processor (say, P2) has to wait until P1 finishes its read/write job. This mechanism causes extremely high communication overhead. Specially, performance degradation increases with higher data and control dependencies due to higher communication overhead.
In real-time parallel computing, synchronization is used to make sure that one event in one thread happens before another event in another thread. In general, co-operation between concurrent procedures leads to the sharing of data, files, and communication channels. This sharing, in turn, leads to the need for synchronisation. For instance, consider a program that contains three routines. Two routines write to variables and the third reads them. For the final routine to read the right values, one must add some synchronization. It is to be noted that using the finer synchronization techniques, process or threads can spend less time waiting on each other and more time accomplishing the tasks for which they were designed (Nichols et al, 1996) .
From a programming point of view, the major difference between the multi-process and multithreaded concurrency models is that, by default, all threads share the resources of the process in which they exist. Independent processes share nothing. Threads share such process resources as global variables and file descriptors. If one thread changes the value of any such resource, the change will be evident to any other thread in the process, if any one cares to look. The sharing of process resources among threads is one of the multithreaded programming model's major performance advantages, as well as one of its most difficult programming aspects. Having this context available to all the threads in the same memory will facilitate communication between threads. However, at the same time, it makes it easy to introduce errors of the sort in which one thread affects the value of a variable used by another thread in ways the other thread did not expect.
In case of general-purpose multi-processing machines, the operating system uses its scheduler to select from the pool of ready and runnable tasks that will run. In a sense, the scheduler synchronizes the tasks access to a shared resource: the system's CPUs. Neither the multithreaded version of a program nor the multi-process version imposes any specific scheduling requirements on its tasks.
Hardware and task granularity are important issues, which play vital roles in achieving higher performance in real-time parallel processing and multi-processing. Hardware granularity (HG) can be defined as the ratio of computational performance over the communication performance of each processor within the architecture. Thus,
When R/C is very low, it is unprofitable to use parallelism. When R/C is very high, parallelism is potentially profitable. A characteristic of fine-grain processors is that they have fast inter-processor communication, and can therefore tolerate small task sizes and still maintain a satisfactory high R/C ratio. However, medium-grain or course grain processors with slower inter-processor communication will produce correspondingly lower R/C ratios if their task sizes are also small (Nocetti and Flemming, 1991) .
In general, fine-grain architectures can perform algorithmic parallelism efficiently, while coursegrain architectures are more suited to strategies such as functional parallelism, where task sizes are large and inter-processor communication is relatively infrequent.
Task granularity is similar to hardware granularity and can be defined as the ratio of computational demand over the communication demand of the task. Typically a high compute/communication ratio is desirable. The concept of task granularity can also be viewed in terms of compute time per task. When this is large, it is a coarse-grain task implementation. When it is small, it is a fine-grain task implementation. Although, large grains may ignore potential parallelism, partitioning a problem into the finest possible granularity does not necessarily lead to the fastest solution, as maximum parallelism also has maximum overhead, particularly due to increased communication requirements. Therefore, when partitioning the algorithm into subtasks and distributing these across processing elements, it is essential to choose an algorithm granularity that balances useful parallel computation against communication and other overheads Hossain, 1995, Tokhi et al., 1997b) .
THE BEAM SIMULATION ALGORITHM
Consider a cantilever beam of length L , fixed at one end and free at another, with a force ( ) t x U , applied at a distance x from its fixed (clamped) end at time t , resulting a deflection ( ) t x y , of the beam from its stationary (unmoved) position at the point where the force has been applied. The motion of the beam in transverse vibration is, thus, governed by the wellknown fourth-order partial differential equation (PDE) (Virk and Kourmoulis, 1988) (1) can be achieved by using the finite difference (FD) method. This involves a discretisation of the beam into a finite number of equal-length sections (segments), each of length ∆x, and considering the beam motion (deflection) for the end of each section at equally-spaced time steps of duration ∆t. Thus, using first-order central FD methods to approximate the partial derivative terms in equations (1) and (2) 
where
) is an 1 × n matrix representing the deflection of grid-points 1 to n of the beam at time step k , S is a matrix, given in terms of characteristics of the beam and the discretisation steps t ∆ and x ∆ , and
. Equation (3) is the required relation for the simulation algorithm, characterising the behaviour of the cantilever beam system, which can be implemented on a digital computer easily. For the algorithm to be stable it is required that the iterative scheme described in equation (3), for each grid point, converges to a solution. It has been shown that a necessary and sufficient condition for stability satisfying this convergence requirement is given by 25 . 0 0 2 ≤ < λ .
To explore the data dependencies of the algorithm, consider the simulation algorithm in equation (3). This can be rewritten for computing the deflection of segments 8 and 16 as follows, assuming no external force applied at these points: It is noted that computation of deflection of a particular segment is dependent on the deflection of six other segments. These heavy dependencies could be major causes of performance degradation in realtime parallel computing or multiprocessing due to inter-processor or inter-process communication overheads.
HARDWARE AND SOFTWARE RESOURCES
To demonstrate the inter-connection issues in multiprocessing environments, two homogeneous computing networks are considered. These are (i) A homogeneous network of T805 (T8) 
IMPLEMENTATION AND RESULTS
To explore the real-time performance of parallel architectures, investigations into inter-processor communication are carried out. The performance of these inter-processor communication links are evaluated by utilising a similar strategy for exactly the same data block without any computation during the communication time, i.e. blocking communications. It has been reported earlier that parallel to parallel communication link is best among the links. However, serial-to-serial communication link is better than the shared memory communication link, whereas, serial to parallel communication link is worst among all the interconnections discussed in section 2 (Tokhi et al., 1997a) . To explore the impact of communication overhead and task granularity, the simulation algorithm (20 segments for 20,000 iteration) was implemented on a homogeneous transputer network of nine T8s. The performance of the architecture was investigated by breaking the algorithm into fine-grains (one segment as one grain). Considering the computation for one segment as a base, the grains were then computed on a single T8 increasing the number of grains from one to nineteen. Figure 2 shows the real-time performance (i.e. computation with communication overhead) and the actual computation time with one to nine processing elements. The difference between the real-time performance and the actual computation time is the communication overhead. It is noted in Fig. 2 that, due to communication overheads, the computing performance does not increase linearly with increasing the number of processing elements. The performance remains nearly flat with a network of more than six processing elements. Note that the increase in communication overhead at the beginning, with less than 3 processing elements, is more pronounced and remains nearly at the same level with more than five processing elements. This is due to the communication overheads among the processing elements, which occur in parallel. Table 1 shows the communication overhead and the corresponding task granularity of the algorithm for the architecture with respect to the number of processors. It is noted that the overall task granularity of the algorithm for this architecture is not impressive due to significant communication overhead. The task granularity for two processors is the best among all. However, the task granularities continually decrease with an increase in the number of processors for up to 7 processors. It is also noted that the T8 architecture achieves almost similar level of task granularity for more than 6 processors due to similar level of communication overhead and computing time.
For parallel multithreading, the simulation algorithm was initially implemented for 20 segments and 60,000 iterations. The reason for considering more iterations, and in turn, more computing is due to the fact that the Pentium PIII processor has extremely high processing power. Thus, it was found difficult to compute precisely for 20,000 iterations. However, the overall communication has not been changed for the higher number iterations. Figure 3 shows the performance of dual PIII computing domain in implementing the simulation algorithm using multithreading process. It is noted that the impact of inter-process communication overhead (since the synchronization, scheduling and other operating system overhead are due to data transmission among the processors) is significantly higher as compared to the actual computing time. The corresponding communication overhead and task granularity for the architectures with different number of threads is given in Table 2 . It is noticed that the multithreading performances for dual PIII processors are even worst than sequential computing within a single processor. Fig. 4 , which shows the execution time for 2 threads with respect to the number of segments. 
CONCLUDING REMARKS
An investigation into inter-processor and interprocess communication issues in parallel computing for real-time high performance computing, emphasising thread synchronization, data dependency of algorithm, scheduling and mapping, has been presented. An investigation into the task granularity due to inter-processor and inter-process communication overhead of the algorithm has also been carried out. This investigation has demonstrated that data dependency is one of the critical issues in multi-processing for real-time high performance computing. It is also demonstrated that fine grain task distribution or threading with higher level of data dependency could cause significant level of performance degradation rather than enhancement. It has also been evidenced that the number of processors in a computing domain for the application limits the number of threads.
Although the issues for the two different architectures are not directly comparable due to the variation of their inter-connection strategy, their impact however, is similar. Thus, it is necessary to explore the suitability of hardware for an algorithm with huge data dependencies.
In general, it is noted that the impact of synchronization, data dependencies and granularity of algorithm are critical issues for inter-processor and inter-process communication, and should always be explored before hand for real-time high performance computing.
