Abstract: Monitoring is fundamental to both debugging and performance analysis. It can provide dynamic execution information for displaying execution states, and statistical data for evaluating the performance of a program. In monitoring parallel programs, a major di culty arises from the intrusive nature of monitoring activities. This paper describes a new approach, the logical clock approach, which aims to minimize the amount of intrusion in monitoring parallel programs, thus achieving a high transparency.
Introduction
A parallel program usually contains multiple threads which interact with each other in cooperatively accomplishing a common goal, either by message-passing or by shared memory read/write. In the development of parallel programs, debugging and performance analysis tools are indispensable, since:
Parallel programming is considered to be far more di cult than its sequential counterpart. The speci cation of parallelism, communication, and synchronization is inevitably errorprone. Consequently, deadlocks and contention between processes can easily result. The arrangement of communication and synchronization, and the process-processor mapping, can have great impact on performance. Thus, to obtain an e cient parallel program, implementation alternatives must be explored. Basically, debugging consists of locating, analyzing, and correcting suspected faults. Information about a program's execution is required for either run-time or post-mortem analysis. Performance analysis consists of the evaluation of a program's performance on a speci c target machine. Statistics regarding various program performance indices have to be gathered. Therefore, the collection of execution information from a program, referred to as monitoring, is fundamental to both debugging and performance analysis.
Unlike the monitoring of a sequential program, where the program behaviour is generally not a ected by the amount of time between any two successive instructions, monitoring a parallel program will unavoidably have e ects on the execution of that program. This is referred to as the probe-e ect: because of multiple threads of control and the existence of contention or nondeterminism, any attempt to observe the behaviour of the system may modify the condition of such contention or nondeterminism and change that behaviour. Obviously, the e ect on the program's execution should be kept as small as possible, so that the information obtained by the monitor does not confuse the programmer's analysis. Otherwise, the very act of monitoring could mask the execution behaviour that should be observed. Thus, we de ne: transparency as being achieved by a monitor when the events that constitute a monitored program's execution are identical both in the presence and absence of the monitor. In this paper, a highly transparent approach to parallel program monitoring is presented.
The degree to which the probe-e ect is intrusive will depend on the monitoring approach adopted. There are some monitoring methods, relying on extensive hardware support, which do not a ect, in a noticeable way, the behaviour of the programs being observed (Aspn as and L angbacka 1991, Wybranietz and Haban 1988) . However, such hardware support can be an expensive addition to the machine and is not always feasible. The alternative is to use a software approach, which obtains information using software only hooks. To reduce the intrusiveness of this method, a fast call is normally necessary: that is, as few operations as possible must be executed for each action which extracts information (Gait 1985 , West 1987 . Consequently, for both hardware and software approaches, the analysis of a program is usually two-folded: a monitoring phase followed by a post-mortem analysis. During monitoring, information about the program's execution is recorded only. In the post-mortem phase, debugging or performance analysis can be conducted either by analyzing the information collected (Snodgrass 1988) or by replaying the program's execution according to this information (LeBlanc and Mellor-Crummey 1987) . However, none of the current hardware or software approaches can easily support the run-time and interactive analysis frequently employed by sequential debugging techniques. Such tools that do exist usually fail to preserve the scheduling semantics of the program.
To support the run-time and interactive analysis of parallel programs, a novel monitoring approach, the logical clock approach, has been developed. It di ers from other approaches to parallel program monitoring in three key aspects:
The probe-e ect is eliminated by introducing the techniques of logical clock management and communication control. The high degree of transparency achieved does not rely on the fast operation of monitoring activities. The proposed approach can keep a high degree of transparency, even though it is implemented by a pure software method. In the next section, the theoretical background to this approach is introduced, based on Lamport's temporal ordering of distributed systems (Lamport 1978) , together with the interpretation of this theory in transparently monitoring parallel programs. Sections 3 to 5 then describe respectively the logical clock approach, its implementation, and an operational study. Finally, conclusions are given in section 6.
Transparent Monitoring
A universal principle is that the future cannot in uence the past. This implies that if an event has some e ect on another event, then the former can never occur after the latter, i.e., the cause must always precede the e ect. This can be very easily captured in a sequential program: because of the single time reference, events can be ordered according to the time of their occurrence. However, it is not an easy task to capture the cause and e ect of a parallel program executed on a parallel computer which has no global reference clock. Since the local clock of each processor may start at a di erent time and advance at a di erent pace, it is possible that the local time at which a cause event occurs is behind the local time at which the e ect event happens! In the discussion throughout this paper, a distributed memory MIMD architecture is assumed, in which each processor has its own clock. It is also assumed that a parallel program running on this kind of architecture is a collection of processes, each of which consists of a sequence of events and which can only communicate with other processes by message-passing mechanisms. An event is de ned as the execution of a single instruction or a set of instructions. The communication can be synchronous or asynchronous. A communication is said to be completed after the message has been transferred from one process to another.
In his paper, Lamport (1978) So, if two events are concurrent events, they must occur independently of each other: that is, there is no Happened Before relation between them and the order as to which event occurs rst does not matter. Therefore, a partial ordering of all events in a parallel program can be obtained by using the Happened Before relation rather than physical time and is de ned as follows:
For each process, there is a counter C, which assigns an integer to each event in such a way that:
If E1 ! E2 then C(E1) < C(E2) (if E1 and E2 are in di erent processes, e.g., P i and P j with clocks C i and C j , then C i (E1) < C j (E2)) If C(E1) = C(E2), E1 and E2 must be concurrent events (we can simply ignore the ordering of their occurrence). Generally, the execution of a parallel program can be pictorially represented by the spacetime diagram. The execution of a process is represented by a vertical line where the occurrence of events is shown in time order from the bottom to the top (see gure 1). A directed line between the vertical lines represents a communication (from the sending event to the receiving event). Thus, for any events with a cause and e ect relation, there must be a directed path in the space-time diagram. Therefore, we can make the following statement: Assertion 1. If event E1 is the cause and event E2 is the corresponding e ect, then C(E1) < C(E2).
This assertion says that: if we can keep the partial ordering (obtained by C) unchanged, the cause and e ect relations will not be changed.
In cause and e ect terms, the de nition of transparency given in section 1 can then be interpreted as: if the cause and e ect inherent in the program is kept unchanged, then transparency in monitoring is achieved. From the above discussion, we have: Assertion 2. Transparent monitoring is obtained by preserving the partial ordering of events.
So, di erent executions of the same parallel program will have the same behaviour provided that they present the same partial ordering of events. Now the basic building block of a parallel program is the sequential process, and it is the interprocess communication that distinguishes the execution of processes in a parallel program from that of a sequential program. Thus, if we can correctly preserve the ordering of communication events (i.e., sending and receiving of messages) for each process, cause and e ect in the entire program will be kept unchanged (assuming there is no internal nondeterminism existing in the program). This can be seen from the space-time diagram: if the ordering of sending and receiving of messages is kept unchanged for each vertical line, the directed lines which connect the vertical lines must also be kept unchanged and the diagrams will have the same structure. So, assertion 2 can be speci ed in a relatively weaker condition: The goal of our monitoring approach is therefore to keep the communication ordering on each process the same as that which exists without monitoring.
The Logical Clock Approach
If the execution of a process in a parallel program is delayed by the monitoring activities (for instance, a communication that is delayed because it is probed by the monitor), the communication sequence may be changed, thus the cause and e ect relation in the original program may be lost. For example, in gure 1, originally process Q receives a message from process P rst, then from process R, illustrated as arrows with dashed lines in the gure. Thus, event p1 could be the cause of event q2. Because of the delay introduced by the monitor, process P might become ready to communicate at a later time T 0 p2 1 than that without monitoring T p2 , so that T p2 < T r2 < T 0 p2 . The monitoring delay on P is D = T 0 p2 ? T p2 . Therefore, because of T r2 < T 0 p2 , process Q will receive the message from process R rst, instead of process P. This is illustrated as arrows with solid lines in gure 1. The original cause and e ect is lost, event r1 rather than event p1 may become the cause of event q2.
In order to achieve transparency, it is necessary to maintain the same communication sequence as that which exists without monitoring (assertion 3). It is clear that in order to compensate for the delay on process P, the vertical lines of processes Q and R must also be stretched so that Q can once again receive the message from process P prior to that from R. In other words, the execution of processes Q and R must also be arti cially delayed for the amount of time D. This is shown in gure 2. However, it is clear that the arti cial delay introduced depends on the amount of monitoring delay incurred on a particular process. Thus, it may seem that postponing the execution of processes cannot be easily managed without a central control mechanism. Schi enbauer (1981) exploited this idea in the design of a debugger for 1 for convenience of description, process Q's real time clock RT is used as the reference for the times of event occurrences. An alternative way of hiding the e ect of the monitoring delay is to introduce the concept of a logical clock which stops when the monitoring operation is being conducted. The motivation for introducing logical clocks into monitoring is to have each process believe that it is executing in real time. In other words, the process will perceive that all events, both internal and external, are occurring at the same logical time with monitoring as they would do in real time without monitoring, so that all events in a process can happen in the same order whether the process is running without or with the monitor. As a result, transparency in monitoring is achieved.
Both Schi enbauer's method and the logical clock approach attempt to capture and preserve the temporal relations between events. In order to achieve transparency, both monitoring strategies are based on the partial ordering theory developed from Lamport's Happen Before relation (assertion 3). The advantage of the logical clock approach is that it avoids the need for a central controller: instead a distributed monitoring system may be adopted with a logical clock monitor on each processor of the parallel architecture.
Intuitive Description of the Logical Clock Approach
The monitor is designed as a program that interacts with the monitored processes in order to get information about process state and inter-process communication. This interaction can be implemented either by hardware (e.g., hardware trap and interrupt) or by software (e.g., code insertion) hooks, which are tightly integrated with certain events in the monitored processes. Thus, the interaction provides the monitor with the capability of inspecting interesting events and extracting the information about what is happening in the monitored parallel program.
During monitoring, a logical clock is maintained for each process, which re ects the real-time behaviour of that process when it runs without monitoring. Moreover, the logical clock renders the monitoring delay invisible in that it stops whenever the process's execution is interrupted
Figure 2: Suspension of the Execution by the monitoring activities, for instance, upon breakpoint or during information extraction. Therefore, although the real-time execution of a process is slowed down by the monitoring activities, it is not changed as measured using logical time. Each process is now executing according to logical time, reading its own logical clock that is unrelated to the logical clock of any other process.
In order to keep the communication ordering unchanged, it is not enough only to introduce the logical clocks. The monitor also has to control the occurrence of inter-process communication so as to preserve the original communication ordering. To achieve this, the monitored process has to inform the monitor (by monitoring hooks), with a timestamped message, every time it is ready to communicate, then wait for the monitor to permit that communication. The monitor makes the decision as to which inter-process communication should happen next based on logical time, rather than real time. It delays the occurrence of a communication if it is aware that there is another possible candidate process for that communication which is running in an earlier logical time. In this way, the communication is prevented from occurring either too early in logical time or too late in real time. The monitor uses the following principle in permitting the communication:
For each process, what should happen at real time T without the monitor now happens at logical time T with the monitor.
The logical clock approach can be illustrated in gure 3, where LT p , LT q and LT r are logical times of processes P, Q and R respectively 2 , and RT p , RT q and RT r are real times. As with the situation in gure 1, process P is delayed by a period of time, D = T 0 p2 ? T p2 , before it executes p2. Since the logical clock is stopped during the monitoring delay, the logical time at which p2 occurs is the same as that without monitoring, that is, T p2 . Now the monitor controls 2 here process Q's logical time LTq is used as the reference for the time of event occurrence. process Q so that it selects its communication partner according to logical time rather than real time. Considering logical time, Q thinks process P is ready to send it a message prior to process R. Therefore, although R is ready to send a message to Q earlier than P in real time, Q still receives the message from process P rather than R.
Since the logical clock approach relies on communication control to preserve the partial ordering, and the logical clocks hide the timing e ect of monitoring activities, transparency can be achieved no matter how seriously the monitor slows down the program's execution. Therefore, the logical clock approach can be used for implementing a run-time, interactive, visual debugger or performance analyzer. Although interactive monitoring may slow down the program's execution by many orders of magnitude, there is very little disturbance of the execution time if it is measured using logical time. In this way, logical clock monitoring can easily be used to collect performance information about parallel program characteristics, such as speedup and channel utilization.
Algorithm
A monitoring hook behaves like a breakpoint, which can be inserted into the program no matter where the user attempts to extract information. An event cannot occur until the monitoring hook which reports the occurrence of that event completes its action. The action of a monitoring hook is said to be completed when the monitor resumes the execution of the process from which information is extracted. Without loss of generality, the monitoring hook which extracts for an event E is de ned as: send request and extract information; wait permission; E According to assertion 3, to achieve transparency, we only need to preserve the communication ordering on each process. Thus, for non-communication events, the monitor will simply adjust the corresponding logical clock in order to hide the monitoring delay. However, for communication events, the monitor also needs to impose control on their occurrence, as described below.
When a parallel process is running together with the monitor, the logical clock of the process P i is de ned as: t = L i (T) where, t is the logical time and T is the real time. In the following, Hstart(E) is used to refer to the real time at which the information extraction for event E starts, and Hend(E) is used to refer to the real time at which the occurrence of the event E is permitted. The monitoring algorithm, therefore, can be informally described as follows:
1 Each process P i has a logical clock L i which is initially set equal to the real time T 0 at which P i starts to execute, i.e., L i (T 0 ) = T 0 2 If event E of process P i is not a communication event, then permit the occurrence of E immediately after completing the information extraction. The logical clock of P i is reset E 0 is the sending event, and the logical time of message arriving is LA j , then P i 's logical clock is reset to max(LA j ; L i (Hstart(E))) and P j 's logical clock is reset to L j (Hstart(E 0 ))
4 If event E of process P i is a receiving event, and its occurrence depends on the execution of other processes P 1 ; P 2 ; :::; P k , in which P r1 ; P r2 ; :::; P rm are ready to send messages to P i , while P b1 ; P b2 ; :::; P bn are not (where m+n = k, and for asynchronous communication,
it is assumed there is no monitoring delay introduced on message transmission, and that LA ri (i=1,...,m) corresponds to the logical arriving time of the message from process P ri ) In the logical clock approach, the monitor not only acts as a collector of information from the monitored parallel program, but also needs to have control over its execution. It requires the following:
the ability to keep a logical clock for each parallel process; the ability to get an accurate timing of any monitoring activity; the ability to trap any inter-process communication; the ability to control the occurrence of an inter-process communication. These capabilities could be directly provided by modifying the machine architecture and integrating it with special monitoring hardware. However, the provision of hardware support is expensive and heavily depends on a particular machine architecture. Thus, a software approach is adopted in our implementation, that is, the program is modi ed by inserting the necessary hooks for the monitor to carry out the above functions at run-time. This avoids the need for special modi cation of the underlying machine architecture, and therefore enhances the portability of the monitor.
Since the occam language concerns itself with the time dimension in a far more profound way than do most conventional programming languages, with the issues of concurrency and synchronization tackled in its deep structure (Pountain and May 1987) , this makes it a real challenge to apply the logical clock approach. We have therefore implemented the logical clock approach in monitoring occam programs on transputer networks (Inmos 1988b) . As the implementation is one which relies entirely on software monitoring hooks, rather than on any special hardware support, the approach is applicable to parallel architectures other than transputers.
Logical Clocks, Monitoring Hooks, and Monitors
Logical clocks are implemented by maintaining a logical clock \register" for each process which accumulates the delay caused by the monitor. (The register is simply a software counter, i.e., a 32 bit integer variable for T4 and T8 transputers.) Therefore, the logical time of the occurrence of an event in process P is the real time at that moment minus the value of the logical clock register for P. Every time an interaction occurs between the monitor and the monitored processes, the logical clock registers are updated. They are also recalculated when a process starts to execute or terminates. These actions are referred to as logical clock updates. As discussed earlier, the monitor and the monitored processes are coordinated using software hooks inserted in the source of the monitored program. A monitoring hook announces the occurrence of an event and contains a wait mechanism to stop the currently running process until it is woken up by the monitor. It has the following general form:
... ... send request and extract information event, E j = changed to ) wait for permission ... event, E ... The monitored process rst sends a request and the extracted information for the event to the monitor, and then stops executing and waits for permission from the monitor. The monitor permits the process to continue to execute according to the type of event reported (this is referred to as permission control).
Sending of requests and waiting for permissions could be implemented either by occam channel communication (high level) or by transputer instructions (low level). The low level implementation (Cai 1991) makes it possible to insert monitoring hooks either into the object code by the compiler (e.g., as in Harter and Heimbigner (1985) ), or into the executable code by the monitor when it is running together with the monitored program (e.g., as in d'Acierno et al. (1990)). The low-level method avoids source code transformation and recompilation, and can also make the monitor more language-independent.
Logical clock update and permission control are conducted by the individual monitors that make up the monitoring system. There is one Logical Clock Monitor (LCM) for each transputer (see gure 4), which is responsible for monitoring internal activities (e.g., channel communication). All monitors that comprise the monitoring system have the same position, one cannot dominate another. The inter-processor (link) communication is monitored by cooperation between neighbouring monitors. The whole monitoring system works in a distributed fashion.
A monitor can be functionally illustrated in gure 5. 
Logical Clock Update and Permission Control
In logical clock monitoring, transparency relies on the correct control of inter-process communication in the monitored parallel program and the accurate maintenance of logical clocks. The monitor aims to let the monitored parallel program simulate the execution behaviour it would have had when running without interruption by the monitoring activities: every decision made by the monitor is based on the time read from logical clocks. In this section, we apply the general algorithm given in section 3.2 to the monitoring of occam programs.
Initially, the logical clock register for each process is zero (case 1 of the algorithm). For non-communication events, permission to continue can be made immediately after the monitor receives the request (case 2). The logical clock update can be illustrated in gure 6 where it is assumed that non-communication event E of process P occurs at time t without monitoring. In logical clock monitoring, if T e is the real time of requesting the occurrence of event E, the logical clock register of P at this moment is D P = T e ? t. After the monitor permits the execution of event E at real time T p , D P is updated to: D P = D P + (T p ? T e ) = T p ? t When event E actually happens, the logical time is:
T p ? D P = T p ? (T p ? t) = t With deterministic communication, there is only one sender and one receiver, which are destined to communicate with each other. So, deterministic communication cannot change the event ordering. The monitor will resume the execution of both sending and receiving processes whenever it nds that both of them are waiting for permission (case 3 of the algorithm).
In occam, communication is implemented using a synchronous protocol, in which either the receiver or sender must wait for the other to become ready to communicate. This wait time, 
Permitting ALT Communication
In permitting non-deterministic communication, in which a receiving process may expect messages from more than one sending process, the monitor has to wait until it is safe to allow the communication to occur (see case 4 of the algorithm), that is, after the process has been granted permission to receive a message, there will be no other message for that process with an earlier logical time.
Non-deterministic communication is provided in occam by the ALT construct, which is composed of a number of guarded processes, for example:
A guarded process is a guard followed by an accompanying indented process. A guard consists of an input process (written with a ? symbol, or SKIP if no input is required), or the conjunction of this simple guard with a conditional expression. A guard is ready if the process on the other end of the channel is ready to output and the conditional expression is TRUE. For convenience of description, we de ne: The ALT Process is the process containing the ALT construct currently being considered. An ALT Sending Process is a process which outputs on a channel in an ALT guard. An ALT sending process is called a ready sending process if it has already issued a sending request for that channel, and an ALT sending process is called a busy sending process if it has not yet issued the sending request. In executing an ALT, rst the guards in the ALT have to be evaluated. Then, if there is at least one guard that is ready, input on the selected guard will be conducted and the guarded process will be executed. Otherwise, the ALT process is descheduled. It will be woken up and rescheduled to the end of the active process queue when there is a process ready to output on a channel of one of input guards. When the rescheduled ALT process reaches the front of the active process queue, the guards are re-evaluated, input on the selected guard will be conducted and the corresponding guarded process will be executed. Both Mitchel et al. (1990) and Inmos (1988a) describe in detail how the ALT construct is implemented on transputers.
According to the ALT implementation on transputers, case 4 of the algorithm can be interpreted as follows: the monitor rst decides whether it is safe to select a ready sending process (i.e., there are no busy sending processes which can possibly be selected by the ALT construct). If it is safe, the ALT process is then rescheduled onto the active process queue. Otherwise, the ALT process is descheduled until it is able to make the decision safely (the alt-start step). When the rescheduled ALT process gains the CPU, a ready sending process will be selected for the ALT communication (the alt-end step). So, an ALT construct is transformed to:
alt-start alt-end ALT construct (Details about the alt-start and alt-end steps can be found in and Cai (1991) .) After granting permission, the monitor will update the logical clocks of both the ALT process and the selected ready sending process according to the algorithm given in the last section.
The permitting of ALT communication relies on the logical times of the ALT sending processes. In the case where the logical time of a busy sending process is less than that of the ready sending processes, the monitor cannot permit the communication. Furthermore, the logical clock of a process can only be advanced after the monitor permits the occurrence of the requested event. Therefore, deadlock may occur. Consider the following example shown in gure 9, in which process P 3 contains an ALT construct which is waiting for input from processes P 1 and P 2 . P 1 is ready to output to P 3 , but P 2 is not. It is waiting to receive a message from P 0 , while P 0 is waiting to send a message to P 1 . It is assumed that P 2 's logical time (L 2 ) is less than the logical time of P 1 (L 1 ). Thus, according to the logical clock approach, the monitor cannot permit P 3 to input from P 1 , even though both are ready to communicate. Deadlock occurs, P 3 is waiting for P 2 to advance its logical time, while P 2 is waiting to receive from P 0 , P 0 is waiting to send to P 1 , and P 1 is waiting to send to P 3 . The cause of this deadlock is very similar to that which arises in conservative parallel discrete-event simulation (PDES), with logical time having the role of simulation time. Most deadlock avoidance algorithms developed for PDES (such as Chandy and Misra (1979) or Cai and Turner (1990) ) can therefore be easily adopted to avoid this deadlock situation in logical clock monitoring.
Monitoring Real-Time occam Programs
The correctness of a real-time system depends on its behaviour with respect to timing. It is either closely coupled to a process existing in the real world or restricted by real-time constraints (such as timeouts). In occam, access to the real-time clock is provided and delay can be introduced by waiting for the real-time clock to reach a stated value. For example, a simple timer process is constructed as follows:
The logical clock approach can be used to monitor real-time distributed systems. In monitoring such systems, there should be no violation not only of the inter-process relationship inherent in the system, but also of the real-time constraints imposed by either the real world or the distributed system itself. The main problem in applying the logical clock approach in monitoring real-time systems is to coordinate the logical time with the real time in such a way that these constraints are not broken. Since the logical time of each process now plays the same role as that of the real time when running without the monitor, it is possible to replace the real-time constraints with logical time constraints, that is, whenever time is needed, it is read from the logical clocks instead of real-time clocks. So, for the simple timer process, it is necessary to read the logical time of that process at the moment when the real-time clock is actually read. The simple process is therefore transformed into: As for applications which depend on a real-time target to provide real-time data, the pace of execution of the application program is bound to that imposed by the real-time target. Any slowing down of the execution of this program may also a ect data acquisition in the real-time target, resulting in the required data being lost or corrupted. The time at which acquisition of a data item is requested when the system runs with the monitor may fall behind the actual acquisition time as the real-time target cannot be monitored! There are two possible ways to overcome this problem. One solution is to isolate the application program from the real-time target and set up a simulation environment for arti cially supplying test data to the application. The test data supply can be slowed down to couple properly with the execution of the monitored application. An alternative is to introduce a virtual target (Plattner 1984) which can lag behind the real-time target by inserting bu ers between them. Data acquisition is then achieved by the virtual target and the monitored application A disadvantage with the rst solution is that building up such a simulation environment is not an easy task and may even be impossible in some cases. Moreover, sometimes software errors may only manifest themselves when running with the real-time data. The second approach is therefore adopted in the implementation (Cai and Turner 1989) .
Operational Study
In this study, an analysis of transparency is conducted by comparing the logical clock approach with the probe approach (West 1987) , a software approach for monitoring occam programs on transputers. The general idea of the probe approach is to insert a probe process between any two communicating processes in order to trap any inter-process communication. This is illustrated by gure 10 in which the program in (a) becomes the one in (b) when the communication by channel C is probed.
Results are presented on monitoring a multiple producers and consumer program and a real-time distributed system. The rationale for this selection is:
A communication race can very easily be disturbed by the monitoring activities. Thus, in the presence of such races, it is very di cult to ensure the transparency of monitoring; It is generally believed that it is very di cult to monitor a real-time distributed system because of the intrusive nature of monitoring and the real-time system's time critical requirements.
Monitoring a Communication Race System
The multiple producers and consumer (MPC) program (Kerridge 1987 ) is shown in gure 11. In this program, each producer may send a message to the consumer after a period of time: The consumer accumulates the number of messages sent by producer 1 in the time interval between receiving two successive messages from producer 0. So, the results for this system consist of a sequence of numbers recorded by the consumer. For example, if the consumer records n1, n2, ..., it means that it has received n1 messages from producer 1 before getting the rst message from the producer 0, n2 messages from producer 1 between getting the rst and the second message from the producer 0, and so on.
Since processes in a parallel program can be executed concurrently, either on di erent processors, or on a single processor in a time-shared manner, the situation in which the producers compete with each other in order to communicate with the consumer will occur. The resolution of the race between producers may depend on a processor's load and the amount of network tra c. It is obvious that the monitoring activities will change the processor's load and delay the execution of the monitored processes, and therefore alter the condition of races. However, the gures in table 1 demonstrate that the logical clock approach can maintain a relatively high transparency in monitoring the program, whether it is executed on a single processor or a network of processors.
The results shown in table 2 are obtained when a deliberate delay is introduced into the processing speed of the monitor. The postpone times shown represent a range of delay times that might be caused by processing or displaying the monitoring information. These results indicate that the high transparency achieved by the logical clock approach does not rely on the fast operation of monitoring activities. By contrast, the transparency of the probe approach decreases dramatically as the time of operation required by the monitor increases. Thus, without decreasing transparency, the logical clock approach can be used to display graphically the execution information at run-time.
Monitoring a Real-Time Application
This system (Dowsing 1988) , which consists of multiple readers and writers (MRW) , is shown in gure 12. Several writers compete for sending their books (or whatever) to the readers, and several readers also compete for receiving these books. Whenever a writer nishes writing a book, he will pass the book to the readers by a set of bu ers in a FIFO pipeline. Readers also have to get books from the pipeline bu er. The pipeline bu er is the only medium between writers and readers, and is thus shared by all writers. The information recorded is: from which writer the reader gets the book. For example, if the result for reader 2 (R2) is W1, it means that he gets the book from writer 1. The results calculated by the consumer are used to measure the competition between producer 0 and producer 1. The logical clock and probe approaches are compared in monitoring this system. The display in the table means that the monitor displays the monitoring information during execution, while non-display means that the monitor only records the information during execution without displaying it at run-time. Suppose the speed of a writer in completing a book depends on the real time, that is, after a xed time period the writer will put a book into the shared pipeline bu er:
clock? now clock? AFTER now PLUS delay --clock delay to.M ! book If the monitor runs together with the processes of this system, it will change the race conditions for both writers and readers as well as the real time constraints, and therefore could a ect the nal results. However, in the logical clock approach, all real-time constraints will be replaced by logical time constraints. The results in table 3 show that logical clock monitoring does not violate the real-time constraints inherent in the system and thereby disturb the race conditions: therefore, transparency is achieved.
Performance Discussion
When a program runs together with the monitor, its execution is slowed down in real time, especially if execution information is displayed graphically at run-time. For example, the multiple producers and consumer program, with a loop delay on producer 0 of 50000 and on producer 1 of 25000, only takes 24374 transputer ticks to nish when running without the monitor. When it is monitored by the logical clock approach, even without displaying the information, it takes 81767 transputer ticks to complete (here, one transputer tick is 64 microseconds).
However, since the logical time can re ect the real time behaviour when the program runs without monitoring, the logical time can be used rather than real time to measure the program's execution. Again the same multiple producers and consumer system is monitored by the logical clock approach, both with and without the display of information. The execution takes 24393 transputer ticks to complete in logical time (in both cases), which is only a few transputer ticks (19) di erent from the value obtained when the program runs without monitoring. So, from the logical time point of view, the performance of the program is only a ected very marginally by logical clock monitoring.
Conclusions
The development of parallel architectures places large demands on the tools for parallel program debugging and performance analysis. This paper has contributed a new approach, the logical clock approach, to the monitoring of parallel programs.
Based on Lamport's partial ordering theory, the algorithm has been designed independently of any particular implementation. Its feasibility has been demonstrated by monitoring occam programs on transputers. In practice, total transparency can never be achieved, it is only a theoretical concept. However, the operational study has indicated that the logical clock approach can have a very high degree of transparency, even when monitoring parallel programs with communication races, or real-time distributed systems.
Since the logical clock approach relies on communication control to preserve the partial ordering, and the use of logical clocks to hide the timing e ect of monitoring activities, transparency can be achieved no matter how seriously the monitor slows down the program's execution. Therefore, the logical clock approach can be used to construct a run-time and interactive debugger or performance analyzer. As a result, sequential debugging techniques such as breakpoints can be used, and program behaviour can be graphically displayed during monitoring .
The logical clock approach can support performance analysis in the following two ways: rst, although the execution of the monitored program is slowed down (perhaps signi cantly), there is very little disturbance of the execution time as measured using logical time. Thus, statistics about speedup, processor and channel utilization can be correctly obtained. Second, obtaining improvements in performance often requires an understanding of the behaviour of the system. For example, the graphical information displayed at run-time can help the programmer to discover quickly any bottle-necks which occurred during execution. Thus, by eliminating the bottle-necks, performance improvement can be achieved.
As we have demonstrated, the logical clock approach can even be implemented by a purely software method. This increases the applicability of the approach and reduces the cost of the implementation. However, the experience has also shown that in order to achieve higher transparency, the design of the logical clock communication control policies should be consistent with the implementation of the programming language in which the monitored programs are written (e.g., as in the monitoring of ALT communication in occam ).
The transputer and occam are representative of many message passing systems and the logical clock approach described in this paper can be similarly applied to other distributed memory architectures. This approach is based on the partial ordering theory proposed by Lamport, as described in section 2. Since this partial ordering theory can be easily extended to shared memory systems (e.g, the communication ordering could be the ordering of access to the locks or semaphores for shared objects), the logical clock approach can also be employed in monitoring shared memory systems.
In conclusion, the logical clock approach can achieve a relatively high transparency in monitoring parallel programs. It can be implemented by a software as well as a hardware method, and can be used to construct a run-time, interactive and visual debugger or performance analyzer.
