1 research outputs found
Fault-tolerant parallel applications using a network of workstations
PhD thesisIt is becoming common to employ a Network Of Workstations, often referred to as a NOW, for
general purpose computing since the allocation of an individual workstation offers good interactive
response. However, there may still be a need to perform very large scale computations which exceed
the resources of a single workstation. It may be that the amount of processing implies an inconveniently
long duration or that the data manipulated exceeds available storage. One possibility is to employ a
more powerful single machine for such computations. However, there is growing interest in seeking a
cheaper alternative by harnessing the significant idle time often observed in a NOW and also possibly
employing a number of workstations in parallel on a single problem. Parallelisation permits use of the
combined memories of all participating workstations, but also introduces a need for communication. and
success in any hardware environment depends on the amount of communication relative to the amount
of computation required. In the context of a NOW, much success is reported with applications which
have low communication requirements relative to computation requirements.
Here it is claimed that there is reason for investigation into the use of a NOW for parallel execution
of computations which are demanding in storage, potentially even exceeding the sum of memory in
all available workstations. Another consideration is that where a computation is of sufficient scale,
some provision for tolerating partial failures may be desirable. However, generic support for storage
management and fault-tolerance in computations of this scale for a NOW is not currently available and
the suitability of a NOW for solving such computations has not been investigated to any large extent.
The work described here is concerned with these issues.
The approach employed is to make use of an existing distributed system which supports nested
atomic actions (atomic transactions) to structure fault-tolerant computations with persistent objects.
This system is used to develop a fault-tolerant "bag of tasks" computation model, where the bag and
shared objects are located on secondary storage.
In order to understand the factors that affect the performance of large parallel computations on a
NOW, a number of specific applications are developed. The performance of these applications is ana-
lysed using a semi-empirical model. The same measurements underlying these performance predictions
may be employed in estimation of the performance of alternative application structures. Using services
provided by the distributed system referred to above, each application is implemented. The implement-
ation allows verification of predicted performance and also permits identification of issues regarding
construction of components required to support the chosen application structuring technique. The work
demonstrates that a NOW certainly offers some potential for gain through parallelisation and that for
large grain computations, the cost of implementing fault tolerance is low.Engineering and Physical Sciences Research Counci