A MIMD processor system is described, which allows forward and backward hardware error recovery (FHER and BHER) for data transfers. Fault-tolerant operation is provided by a multiple bus system and distributed error correction controller (ECC) hardware. Addresses and data are transferred through the bus system together with an error correction code, which is generated and checked by the ECC hardware. FHER is possible for all single-bit errors.  Overlapping bus protocols allow these distributed ECC operations to be executed and synchronised concurrently with  the transfer. Double-bit and multi-bit transfer errors as well as arbitration errors in the multiple bus system are  detected and corrected by automatic rerouteing of the transfer path (BHER) . These features improve the reliability of large multiprocessor systems significantly.
INTRODUCTION
An important aspect of a computer system is its reliability. It is determined partly by the technology employed, i.e. the usage of VLSI components versus discrete ones, of soldered connections versus mechanical ones, etc. However, the reliability of a system depends strongly on the system size. If, for instance, a multiprocessor system is assembled from identical modules, then the probability of a totally error-free operation decreases exponentially with the number of modules used. Hardware errors are thus a usual operating condition in large multiprocessor systems. For this reason, such systems have to be designed so that errors can be corrected or bypassed in some way. If treated in this way, the system performance is only degraded gracefully. The frequency of hardware errors and therefore the need to cope with them becomes even more serious, if a relatively unreliable technology has to be used. This is the case for multiprocessor interconnection networks, which usually consist of a very high number of mechanical connectors. Here it is essential that interprocessor communication should still be possible even in the presence of hardware errors.
Fault tolerance in multiprocesor systems, and particularly in their interconnection networks, has been investigated theoretically in the last few years. It has been shown how redundant hardware (extra stages) can be added to multistage interconnection networks to provide alternative paths in case of errors. 1 The importance of localising the faulty component precisely has been pointed out. This has been discussed for centralised and distributed routeing control mechanisms. 2 Multistage interconnection networks with graceful degradation have also been analysed using mathematical models. Under realistic assumptions and approximations, formulas have been derived which allow the estimation of a total error rate depending on the reliability of processors, memories, and the interconnection network. 3 A large class of rearrangeable multistage interconnection networks have •Now at: Lehrstuhl fOr Informatik V, Universitat Mannheim, D-6800 Mannheim 1, Germany (email: manner^de.uni-mannheim informatik, mp-sunl). been investigated. Their performance in terms of speed, cost and fault tolerance was compared. 45 A small number of papers discussed reliability aspects and faulttolerant features of multiple bus systems. Taking graceful degradation into account, the reliability of multiple bus systems was compared to that of a crossbar switch. It was shown that the reliability of a 16-processor system is increased nearly linearly from 20% to 100% by using from 2 to 12 or more buses. 6 All these investigations strongly suggest the use of a fault-tolerant interconnection network with redundant paths.
At the Physical Institute of the University of Heidelberg such a multiprocessor system, called Polyp, 7 has been developed. It is currently used in nuclear physics, 8 digital image analysis, 9 medical 10 and biological 11 applications. Due to the system size, hardware errors of various kinds were expected with non-negligible probability. These errors can be classified as being of temporary or of permanent nature. Typical examples for the first class are soft errors in dynamic memories, electrical interferences on transmission lines, bad mechanical connections, etc. The second class covers errors due to defective components of any kind, e.g. faulty bus drivers. To increase the MTBF (mean time between failures), fault tolerance against the most probable errors has been implemented. This includes correction of temporary errors and bypassing of permanent errors through system reconfiguration. Error detection and correction is handled by multiple co-operating ECC units. To enable permanent errors to be bypassed by system reconfiguration, hardware redundancy has been built into the system architecture. In such a case errors will cause only a graceful degradation of the multiprocessor performance.
Error correction can be done in two ways. If an ECC is only capable of detecting errors, it will request a repetition of the action which caused the error (BHER). Temporary errors can be corrected in this way. For permanent errors, however, a reconfiguration of the system is necessary. If an ECC is also capable of correcting errors, information redundancy is used to reconstruct the correct information (FHER).
The current version of the ECC in the Polyp system protects address and data transfers over the multiple bus system, as well as data storage in memory, with the help THE COMPUTER JOURNAL, VOL. 35, NO. 4, 1992 361 of check bits derived from a modified Hamming code. These check bits are sufficient to allow for the reconstruction of all single-bit errors and detection of all double-bit errors and many multi-bit errors. The error correction unit therefore combines FHER by reconstruction of single-bit errors and BHER for double-and many multi-bit errors. BHER becomes possible because repeatability of transfers, as well as an automatic reconfiguration of the multiple bus system, are included in the system architecture. This kind of error handling increases the MTBF for the most error-sensitive system components, the multiple bus system and the dynamic memories, to the level of all other components.
Multiprocessor systems based on standard buses like VME bus, Multibus II, Fastbus and Futurebus cannot be operated with the same degree of fault tolerance. These buses protect data by parity only and therefore do not allow FHER. In addition, BHER can be implemented in this case only for temporary errors, because the usage of multiple buses is not supported. Therefore, permanently defective components in the bus systems cannot be bypassed and an automatic hardware reconfiguration is not possible.
THE HEIDELBERG POLYP MULTIPROCESSOR
The Heidelberg Polyp multiprocessor system was designed as a modular, very powerful, and flexible MIMD system for a wide range of appliations. Applicationdependent, it can be assembled from a high number of modules like general-purpose processors, I/O processors and a host computer ( Fig. 1 ). Modules of the same kind are organised as a pool of identical resources. A decentralised hard-and software management allows the system load to be distributed over all available modules, independent of the actual number of modules used. This number can be chosen arbitrarily. All modules are interconnected by a multiple bus system, the Polybus. It can again be considered as a pool of independent buses, capable of concurrent transfers. Pool-size independence allows the choice of an arbitrary number of buses, and therefore the global bus bandwidth may be adapted to the actual needs. The Polybus system is also managed by a decentralised arbitration hardware, assigning available buses to requesting modules randomly. This is essential for a fault-tolerant operation of the bus system as described below. Current systems use 30 modules and two global buses, but could be upgraded to 60 modules and eight global buses by filling slots with more PCboards. Each bus is an address/data multiplexed 32-bit bus, comparable to Futurebus or VME bus.
All modules are again built up from units like processors, memories, bus switches to the Polybus system, or controllers ( Fig. 2) . All units of a module are interconnected by a local bus. This bus is also multiplexed. Mechanically the local bus is a narrow backplane, several of which are mounted in one crate. The modules consist of PC-boards plugged on to the backplanes. Current processor units use a 68000 CPU or a 68020/ 68881 CPU/floating point processor combination. More than one processor can be used within one module. They operate on local memory units assembled from dynamic memory of 1 Mbyte size.
In general, there are two levels for a logical description of communication in the Polyp system. On the higher level, there exist a number of modules which are interconnected through the Polybus. Besides local transfers, which are transfers inside one module, communication always takes place between two or more modules and consists of external transfers. The module initiating the transfer is named 'commander' and the answering one(s) the 'responder(s).' The Polyp system provides different addressing modes, that is, direct, broadcast and broadcast-select addressing. In direct addressing exactly one responder is being accessed through its direct address. Direct addresses are unique to the whole system. Each module has a second, independently programmable address, the broadcast address. It does not have to be unique and can therefore be used to form pools of identical modules. Here a commander can access a whole pool (broadcast) or select one member of the addressed General-purpose processor pool I/O processor pool Global memory pool Host computer Global bus pool pool anonymously (broadcast select). A transfer between two modules will therefore always use two local buses and any one of the available global buses together with the corresponding two bus switches. The selection of the Polybus and the bus switches is done during arbitration. Looking at the lower level, each module can be built up as shown in Fig. 3 . The unit responsible for the transfer in all addressing modes is called 'master', the unit being addressed in a transfer 'slave'. So a successful transfer will always include one master and one or more slaves. A controller unit can 'control' transfers by asserting and changing certain bus signals, e.g. for correcting occurring errors. 12 In the Polyp system it is then called an ECC.
OPERATION MODES OF THE ECC
The ECC units of the Polyp system allow for hardware recovery of bit errors. They support FHER for all singlebit memory and transfer errors as well as BHER for all double-bit and many multi-bit transfer errors. Therefore the ECC Units have to perform three basic tasks; to generate redundant information for addresses and data, to check it, and to correct it by FHER, if possible. For simplicity reasons, we begin with a description of BHER, which only uses the first two operation modes.
Due to the architecture of the Polyp system, especially the decentralisation of ECC and local memory units, it depends on the actual transfer mode chosen, whether generation, checking, or correction of information has to be done. The general philosophy for the current implementation was • to generate check bits for unprotected addresses and data, if these bits can be checked by another ECC during the same transfer or by the same ECC during a later transfer, • to check each piece of information protected with check bits, # to correct single-bit errors (FHER) and to provide correct information for the destination of the information as well as for the source, # to force repetitions of a failed operation for errors which are not directly correctable (FHER impossible), allowing for BHER. 
Generation of check bits
A very simple situation arises with local transfers between a processor and its local memory ( Fig. 4 ). Here a processor asserts unprotected addresses. However, there is no advantage in protecting them by check bits supplied by the ECC -there would be no other unit to check them. In contrast, unprotected data asserted by the processor during a local write can be protected. They are stored together with the check bits in the memory and can be checked during later reads. If the memory supplies unprotected data during local reads, check bits are again not generated. The situation is more complicated for global transfers via the Polybus system ( Fig. 5 ). Here, it is useful for the commander ECC to generate check bits for addresses and data during write transfers, because both can be checked by the responder ECC. The responder ECC again does not generate check bits for unprotected addresses. Check bits, however, are supplied for unprotected data during writes (because they can be checked by the responder ECC during later reads) and for unprotected data readout of the memory (because they can be checked by the commander ECC). In this case, the check bits will be written into the memory together with the data.
Checking of protected information
The ECC units check all protected information asserted on to their local bus. During local transfers, this can only be the case for data readout of the local memory as described above (Fig. 4 ). Protected addresses are additionally used during global transfers (Fig. 5 ). They are checked by the responder ECC. For writes, this unit also checks protected data coming from the Polybus system as well as protected data asserted by the local memory during reads. In addition, the commander ECC checks these data (which have already been checked by the responder ECC for memory errors) for additional transfer errors. If an error is detected, FHER or BHER is triggered. 
THE

Backward error recovery
In the Polyp system, all single-bit errors can be corrected by FHER as described below. BHER, however, is used to correct multi-bit errors by repeating the failed operation until it is completed without errors. This is always possible for temporary errors. Permanent errors, however, can only be corrected by BHER, if redundant hardware is available, i.e. if defective components can be bypassed. In the Polyp system, failed local transfers -i.e. reads from the local memory -cannot be corrected in this way. Failed global transfers, however, can be corrected by BHER, if the error occurred in the multiple bus system. Fault tolerance in the multiple bus system is particularly important simply due to the size and complexity of such a system. Additional errors may arise here due to faulty connectors, bus drivers, etc.; in current Polybus configurations, there are over 10 4 -10 5 mechanical contacts and a comparable number of bus drivers. This system was therefore designed to provide redundant communication paths and to allow defective bus switches and interconnections to be bypassed.
The Polybus system consists of any number of buses, which interconnect any number of local buses like a double crossbar switch (Fig. 6 ). For reasons of speed, scalability, and fault-tolerance, it is managed by a decentralised arbitration logic. 12 An identical arbitration circuit is used at each crossover of a local bus and a global bus. Arbitration is done for each global bus by a rotating token. Because all processors of the Polyp Polybus switches Fig. 6 . Structure of the Polybus system system operate asynchronously, individual modules can request a global bus at any time. Such a request is asserted on to all global buses. The arbitration ensures that the first token to arrive is stopped, making the particular bus unavailable to other modules. Due to the asynchronous operation of the Polyp system and the statistics inherent to multiple, independent instruction streams, the tokens are distributed randomly along the bus system. Global buses and bus switches are therefore assigned randomly to requesting modules.
As an example for BHER in the Polybus system, we assume that one of the bus switches permanently causes detectable multi-bit errors. If, by chance, a transfer is routed via this bus switch, this error will be detected by the responder ECC for addresses and data written or by the commander ECC for data read. Because the error cannot be corrected by FHER as presumed, the transfer is aborted by the ECC. A re-try bus signal forces the master to release all buses allocated for the current transfer -i.e. its own local bus, the assigned global bus, and the local bus in the responder module -and to re-try the aborted transfer. This is done by processors using the re-run feature of the 680 x 0 microprocessor family. 13 This re-try thus is done by microprogram, without the system or application software being involved. To repeat the transfer, the processor allocates all three buses again. The random global bus assignment now makes it very improbable that the same defective global bus is used again. But even then, another re-try is executed by the master (up to 16 times), until the system software is notified of a noncorrectable transfer error. Experiences with Polyp systems have shown that this case does not happen in current applications, even with only two global buses.
In the same way, more complicated errors can be corrected. If, for example, the distributed arbitration logic fails and a single bus is assigned to two processors at the same time, this can be detected and corrected by BHER as before.
Forward error recovery
In the Polyp system, all protected information asserted on to a local bus is checked. As described above, there are two units which may provide protected information. During local and global read transfers, a memory unit may assert data together with check bits on to the local bus. During global reads, the responder bus switch asserts protected addresses on to the responder local bus, whereas the commander bus switch asserts protected data on to the commander local bus. During global writes, the responder bus switch asserts protected addresses and data on to the responder local bus. In all these cases, FHER is performed, if a single-bit error is found ( Fig. 4 and 5 ). This is done by stopping the current transfer, correcting the faulty information, and continuing the transfer. Additional action is required, if data which have been asserted by a memory unit are being corrected. To avoid accumulation of single-bit errors, the corrected data together with the check bits are written back into the memory concurrently with the transfer to the master. Such a read operation is therefore executed as a read by the master, but as a write by the slave. This requires special bus protocols. 
O V E R L A P P E D BUS OPERATIONS
In a conventional processor system, two units are synchronised during a transfer: the master, which initiates the transfer, and the slave, which reacts. The same holds true for the Polyp system without ECC units. One exception is broadcast transfers, where multiple slaves are addressed at the same time. With ECC units, however, the situation is more complicated. The architecture of the Polyp system allows it to decide independently for each module whether it will use an ECC or not. In local transfers, up to three units have to be synchronised by a bus protocol: the master, the slave, and possibly one ECC. In global non-broadcast transfers, the master, the slave, and possibly one commander ECC and one responder ECC participate. In global broadcast transfers, the master and any number of slaves may or may not use an ECC. Handling of all these situations requires a much more complex bus protocol than usual.
For a better understanding of the protocol actually used, we first explain a naive implementation of such a bus protocol. As an example, we take a read transfer from another module. Here, the master in the commander module first asserts addresses on to the local bus. The ECC, which has to supply the check bits, now has to stop the transfer during generation time. As soon as the check bits are asserted, the transfer would be continued with arbitration for a global bus. After assignment, protected addresses are transmitted to all modules. One of them (or more for broadcasts) is selected, so that the protected address is copied on to the internal bus of the responder module(s). Once again the transfer would be stopped by the responder ECC for address checking. Only then is a read access to the memory started. Two more delays are introduced during the data cycle of the transfer, on the responder module for checking memory contents and on the commander module for checking the data transfer. However, such a protocol would slow down all transfers. Using current technology, these four delays needed to generate and check the Hamming code for 32-bit words take approximately as long as the actual transfer. This reduces the total bus bandwidth by a factor of two. Delays due to FHER are considerably longer, but occur much less frequently, and do not degrade the system performance.
Such a reduction of the bus bandwidth is, however, not acceptable for a multiprocessor system, because this would be equivalent to using only half the number of available global buses. Therefore a bus protocol is used, which provides an overlapping operation of master, slave and ECC units. To explain the speed advantage of this protocol, we also use an external read transfer as an example (Fig. 7) . The transfer is again started by the master asserting an unprotected address on to the commander module bus. In contrast to the former situation, the transfer is not stopped at this point. The commander ECC begins to generate check bits, whereas at the same time a global bus is allocated, the unprotected addresses are transmitted, a responder gets selected, and a memory read cycle is started. This is possible without knowing whether the address was transferred correctly. Checking of the address on the responder module is done in parallel with the read operation, after the check bits for the address have been generated by the commander and transmitted to the responder. The memory now asserts protected data, which are transferred back to the commander via the global bus. At this time the responder ECC has usually finished checking the address, and checking the data for memory errors is started. After the signal transmission delay, the commander ECC is started again and checks the same data for transfer errors. Only after both ECC units signal correct data will the master accept it and finish the transfer. Using this protocol, essentially one checking operation only slows down all transfers. The other three generation/checking operations are executed in parallel with the normal transfer.
IMPLEMENTATION OF THE ECC
To perform all the tasks mentioned above, the ECC has to interpret all bus protocols of the different transfers. Therefore the ECC was divided into two blocks (Fig. 8) . One of them, the error corrector, handles all tasks concerning error recovery, i.e. the generation of check bits, the check of protected information, and its correction. The other one is responsible for the understanding and interpretation of all possible states of the local bus and the error corrector block. Because the Polyp system uses asynchronous bus protocols, this part of the ECC could, most easily, be designed as a logic sequencer. The input signals to this bus sequencer block are defined through the different bus control lines necessary to control the bus transfers, the status signals of the error corrector logic, and certain status variables of the bus sequencer itself. In our case the bus was defined before the ECC had been developed and only the interface to the error corrector had to be defined. Due to the large number of possible states in all the different transfers and the large number of necessary input signals, the bus sequencer part of the ECC was designed with PROMs (programmable read-only memories), using all input signals as addresses. The signals controlling the error corrector and the bus protocol signals generated by the bus sequencer logic correspond to the data outputs of the PROM. All possible states of the input signals of these PROMs are programmed accordingly. Even without the error corrector block of the ECC, the bus sequencer will react during each bus transfer in the correct way, asserting all necessary bus control lines as if the corrector had generated all necessary check bits or checked addresses or data correspondingly. The development of the second part, the error corrector, was mainly influenced by the different VLSI chips available. The first prototype of the Polyp ECC unit uses an AM2960 EDC chip. 14 This circuit is able to create check bits and to test and correct protected data, if the correct signals are asserted.
The modular structure of the ECC allows a high degree of flexibility. Through the use of programmable parts in the bus sequencer logic, changes or enhancements can be easily implemented. This circuit could also be easily adapted to completely different bus systems by redefining the input/output signals of the bus sequencer logic and by reprogramming the PROM. The definition of a hardware interface between the error corrector and bus sequencer logic also allows the implementation of other EDC chips with minor changes only.
The current prototype version could be enhanced by hardware error logging. Two counters are used to count the number of FHER and BHER operations respectively. Both are able to interrupt the current master on an overflow condition. With 8-bit counters, an overflow would occur after 256 errors, this is sufficient for the operating system to notice a permanent hardware problem and to take care of it, for example to disable a defective global bus or to isolate a faulty memory module.
CONCLUSIONS
In this paper, a fault-tolerant multiple-bus multiprocessor system has been presented. Those aspects of the system architecture which are relevant for fault-tolerant op-eration were discussed. This includes the decentralised modular structure, the multiple bus system providing redundant interconnection paths, and distributed errorcorrection controllers. These controllers, together with the bus system, allow forward and backward hardware error recovery for temporary and permanent errors. Address and data transfer errors are corrected without software intervention. The different operation modes of error correction controllers, i.e. the generation of check bits, the checking of protected information, and the correction of errors have been discussed in detail. It was shown that such a decentralised error correction hardware requires more complicated bus protocols for two reasons. First, because up to four units participate in a non-broadcast transfer, i.e. one master, one slave, and up to two error correction controllers (in broadcast transfers, any number of error correction controllers have to be synchronised by a proper protocol). Secondly, because a straightforward protocol design would result in a considerable slowdown of all transfers, roughly by a factor of two. It was shown how a special design of the bus protocols allows most of the error correction controller operations to be performed concurrently with the usual transfer cycles, so essentially transfers are not slowed down.
Several multiprocessor systems with up to 30 processors and two global buses have been set up and operated continuously for two years. A first implementation of error correction controllers has been tested. The ECC's partitioning into a bus sequencer and an errorhandling block offers a high degree of flexibility. It can be adapted easily to different bus protocols and/or new error correction VLSI components.
Experiences with small systems (< 10 processors) have shown that error correction for transfers is not necessary here; 30-processor systems, however, could only be operated reliably using the above FHER/BHER features. In this case the overall error rate, mainly due to electrical interferences on the large multiple bus system, could be decreased by a factor of 100.
In an ongoing research project 910 one of the 30processor systems was upgraded to 50 processors. Since, in recently built systems, a new development of the busbackplane design improved the overall reliability considerably (a factor of ~ 30), here again we had the same error rates as before due to the increased number of components used. This means that here also the FHER/BHER features discussed above play a key role in the reliability of the whole system.
