Responsive Fault-Tolerant Computing in the era of Terascale Integration – State of Art Report by Ezhilchelvan P
 
 
 
 
 
 
 
   
 
 
 
 
 
 
 
COMPUTING 
SCIENCE 
 
 
 
 
 
Responsive Fault-Tolerant Computing in the era of Terascale Integration 
– State of Art Report 
 
 
P. Ezhilchelvan 
 
 
 
 
 
 
 
 
 
TECHNICAL REPORT SERIES 
              
 
No. CS-TR-1066 February, 2008 
TECHNICAL REPORT SERIES 
              
 
No. CS-TR-1066  February, 2008 
 
 
 
Responsive Fault-Tolerant Computing in the era of Terascale Integration – State of 
Art Report 
 
 
Paul Ezhilchelvan 
 
Abstract 
 
 
 
Scaling in hardware integration process results in IC-process geometry reductions, 
lower operating voltages and increased clock speeds. This paper first surveys the 
reliability obstacles these developments give rise to and then points out that 
computing systems can no longer be safely assumed to fail only by crashing. Yet this 
assumption is at the core of primary-backup replication which the literature presents 
as the appropriate, and hence the most widely used, strategy for time-critical fault-
tolerant applications. The paper then observes that building computing nodes with 
announced crash failure mode is a promising way forward to deal with the emerging 
reliability challenges. Work carried out to assure such a failure mode has also been 
briefly surveyed.  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
© 2008 University of Newcastle upon Tyne. 
Printed and published by the University of Newcastle upon Tyne, 
Computing Science, Claremont Tower, Claremont Road, 
Newcastle upon Tyne, NE1 7RU, England. 
Bibliographical details 
 
EZHILCHELVAN, P.. 
 
Responsive Fault-Tolerant Computing in the era of Terascale Integration – State of Art Report 
[By] P. Ezhilchelvan. 
 
Newcastle upon Tyne: University of Newcastle upon Tyne: Computing Science, 2008. 
 
(University of Newcastle upon Tyne, Computing Science, Technical Report Series, No. CS-TR-1066) 
 
Added entries 
 
UNIVERSITY OF NEWCASTLE UPON TYNE 
Computing Science. Technical Report Series.  CS-TR-1066 
 
Abstract 
 
Scaling in hardware integration process results in IC-process geometry reductions, lower operating voltages and 
increased clock speeds. This paper first surveys the reliability obstacles these developments give rise to and then 
points out that computing systems can no longer be safely assumed to fail only by crashing. Yet this assumption is 
at the core of primary-backup replication which the literature presents as the appropriate, and hence the most 
widely used, strategy for time-critical fault-tolerant applications. The paper then observes that building computing 
nodes with announced crash failure mode is a promising way forward to deal with the emerging reliability 
challenges. Work carried out to assure such a failure mode has also been briefly surveyed.  
 
. 
 
 
About the author 
 
Paul Devadoss Ezhilchelvan received Ph.D. degree in computer science in 1989 from the University of Newcastle 
upon Tyne, United Kingdom. He received the Bachelor of Engineering degree in 1981from the University of 
Madras, India, and the Master of Engineering degree in 1983 from the Indian Institute of Science, Bangalore. He 
joined the School of Computing Science of the University of Newcastle upon Tyne in 1983 where he is currently 
a Reader in Distributed Computing.  
 
 
Suggested keywords 
 
MICRO-MINIATURISATION,  
SOFT ERRORS,  
SINGLE-EVENT TRANSIENTS,  
CRASH MODE,  
FAIL-ANNOUNCE MODE 
Responsive Fault-Tolerant Computing in the era of Terascale Integration – 
State of Art Report 
 
 
Paul Ezhilchelvan 
School of Computing Science, Newcastle University, Newcastle upon Tyne, UK 
Paul.Ezhilchelvan@ncl.ac.uk 
 
 
Abstract 
 
Scaling in hardware integration process results in 
IC-process geometry reductions, lower operating 
voltages and increased clock speeds. This paper first 
surveys the reliability obstacles these developments 
give rise to and then points out that computing systems 
can no longer be safely assumed to fail only by 
crashing. Yet this assumption is at the core of primary-
backup replication which the literature presents as the 
appropriate, and hence the most widely used, strategy 
for time-critical fault-tolerant applications. The paper 
then observes that building computing nodes with 
announced crash failure mode is a promising way 
forward to deal with the emerging reliability 
challenges. Work carried out to assure such a failure 
mode has also been briefly surveyed.  
 
1. Introduction 
 
Concerns expressed in the literature point out that 
scaling in hardware technologies can lead to 
application-level failures which were once considered 
less probable.  For example, soft errors or accidental 
flipping of data bits, typically due to cosmic radiation, 
were once considered to be safely containable at lower 
levels by means of error detecting and correcting 
codes. However, CMOS technological trends resulting 
in IC-process geometry reductions, lower operating 
voltages and increased clock speeds have changed the 
landscape: soft error sensitivity of an SRAM cell has 
increased considerably; more disturbingly, single event 
transients, aided by faster clocks and lower capacitance 
of smaller gates, can more easily traverse through logic 
gates and get latched as valid signals, causing logic 
circuits to operate incorrectly. Together with an 
exponential growth in SRAM cells within a 
microprocessor, low-level correction codes alone 
cannot be totally relied upon to prevent soft errors and 
transients from causing a server system to generate 
erroneously computed results. That is, it is no longer 
realistic to hold on to the application-level failure 
model that a computing system either operates 
correctly or simply crashes.  
Crash failure model is central to the primary-backup 
replication strategy. The literature points out that this 
strategy remains the best among the alternatives for 
achieving fast response times in fault-tolerant real-time 
applications. This raises a central question of how to 
preserve the attractions of primary-backup replication 
and also to deal with the reliability concerns which the 
terascale integration unintentionally gives rise to. We 
propose that building replication units with the features 
of self-checked behaviour and announced failing, 
constitutes the way forward. We also provide a brief 
survey on efforts carried out in building nodes that are 
assured to fail only in ‘announced-crash’ manner.  
The paper is organised as follows. Next section 
surveys the literature pointing out the emerging 
reliability obstacles of the terascale integration era. 
Section 3 presents the primary-backup replication 
scheme, the attractions it poses for responsive fault-
tolerant computing, and some recent work that exploit 
the features of this scheme for time-critical 
applications. Section 5 motivates the notion of 
‘announced crashing’ and describes the work carried 
out in obtaining that fault abstraction.   
 
2. Terascale Integration – from a Fault-
Tolerance Perspective 
 
As process scaling due to hardware integration 
technologies continues, the number of transistors 
within a microprocessor will reach several billions 
within the next generation. Clock speed, currently in 
multi GHz range, points to an attractive upward trend 
in future while operating voltages promise a desirable 
downward trend. These developments considerably 
increase the available computational power within 
single-user devices (such as mobile phones, iPods), 
embedded systems and sensor nodes – and the 
resulting benefits are manifestly obvious in today’s 
society. However, from a reliability perspective, 
advances in hardware integration technologies bring 
with them some major obstacles that need to be 
overcome: they leave the devices highly vulnerable to 
in-the-field faults. In particular, there is an alarming 
rate of increase in device sensitivity to a class of 
transient faults traditionally termed as soft errors. 
These errors are caused by ionising radiation [1, 2, 3, 
4] ubiquitous in terrestrial altitudes and more so in 
high altitudes. 
Shipped chips can exhibit unreliability due to four 
different reasons [3] that can be directly attributed to 
driving forces of tera-scale integration.  
Aging: It is known that transistor saturation current 
degrades over time affecting performance. This 
saturation gets worse as transistor geometries decrease. 
According to [3], ageing of microminiaturised nodes 
will be so bad that its effects need to be catered for at 
the system level throughout the lifetime of 
microprocessor, not just after some ‘good’ initial 
period has elapsed. 
 Screening for defects becomes harder making one-
time factory testing ineffective. So, a new device can 
no longer be safely assumed to be defect-free. This 
undermines the traditional fault-repair hypothesis that 
a new module replacing a defective one remains free of 
faults during reconfiguration.  
Soft errors in storage devices. They occur when 
data bits switch states from 0 to 1 or 1 to 0. The 
resulting bit corruption stays on until new data are 
written. They are triggered when the electrical charge 
generated by ionising radiation accumulates beyond a 
critical threshold. The resulting bit-state errors are non-
permanent and non-recurring in nature [5]. (Hence is 
the qualification ‘soft’). In the literature, soft errors 
resulting in bit-flipping are also called single event 
upsets (SEUs)   
 DRAM bit sensitivity to soft errors has decreased 
by 4 to 5 times per generation, thanks mainly to the 
development of 3D capacitor cells [1]. Consequently, 
as the DRAM density (bits-per-system) scales up, the 
DRAM system robustness against soft errors remained 
constant over several generations and the trend is 
expected to continue. The trend with SRAM devices is 
however moving in the opposite direction.  SRAM bit 
sensitivity remains constant as the geometry sinks deep 
into sub-micron range and operating voltages scale 
down. This means that increasing memory density can 
only lead to an increase in the sensitivity of SRAM 
systems. Given that the amount of SRAM within a 
microprocessor is experiencing an exponential growth, 
soft error rates are more likely to only increase, with 
no end in sight. According to Baumann [1], this is a 
great concern to chip manufacturers because SRAM 
constitutes a large part of all ICs today. 
Soft errors in logic devices. This fourth and final 
class of ramifications represents (by far) the most 
serious reliability obstacle that is only expected to get 
more formidable in future systems [6].  In a 
combinatorial circuit, collection of a sufficient 
radiation-induced charge can generate a short-lived 
transient in the output, traditionally called a single-
event transient or SET. This phenomenon itself is long 
known and has little to do with hardware process 
scaling. It can however give rise to logic errors due to 
the combined effects of geometry reduction (below 
65nm), faster clocks and lower operating voltages. A 
SET can easily traverse through many logic gates as 
the propagation delay is reduced with the clock being 
so fast on reduced process geometries; aided by lower 
voltages, the probability that a SET is clocked in as a 
signal on the latching edge of the clock, increases. In 
older technologies, a SET could not propagate because 
it would not be strong enough (relative to the operating 
voltage) to produce a full voltage swing or it would be 
attenuated by large propagation delays; with new 
technologies, however, it stands a higher chance of 
producing an incorrect logic output. SET-induced soft 
errors can affect not just standard-cell logic but also 
programmable logic [2].   
 
3. Low-Level Mitigation Attempts  
 
Error detection codes (EDC) and error correction 
codes (ECC), such as hamming codes, have been in 
use to guard very effectively against SEUs involving a 
single bit within a memory word.  Concerns still 
remain on the increased likelihood of SEUs affecting 
multiple bits due to microminiaturisation of the new 
era.  Nicolaidis of iRoC is quoted in [2] as: “A single 
neutron strike can emit multiple secondary particles, 
typically ions, that can strike multiple memory cells. If 
the affected cells belong to the same memory word, 
then corrective codes cannot fix it”.  
Baumann of Texas Instruments [1] cautions against 
any notion that ECC and EDC for SRAM within a 
microprocessor would continue to be cost-effective. 
The overhead of these protective codes has been 
acceptably small only when memory arrays were big.  
But when SRAM is in small blocks throughout the 
SOC, employing EDC/ECC can lead to a 50% increase 
in area and turns out to be a self-defeating option in the 
age of miniaturisation. Simulation studies by Baumann 
[1] indicate that a computational system – a typical 
commercial server in continuous operation – with a 
modest 100 chips and each with 25 Mbit SRAM can 
experience 5 SEUs per month. These SEUs cannot 
lead to an application level error only when (i) none of 
them affects more than a single bit in a SRAM memory 
word and (ii) ECC protection has been built-in. The 
former amounts to taking a chance with a randomly 
striking neutron. Not surprisingly, research focus is 
shifting to investigating a hybrid approach involving 
both hardware and software support [7, 8]. 
SET induced logic errors were not a problem for 
feature sizes of 0.25 micron and above and are a 
concern for military and aerospace applications at 
0.18-0.13 micron level. They are poised to become a 
headache for logic designers as feature sizes go to 65 
nm.  “At 90-nm design rules, very small node charges 
can affect a 90-nm flip-flop; as the industry goes to 65 
nm, the flip-flop soft error rate will increase”[6].  
The current trend in military and aerospace 
applications is to take a ‘hardened-by-design’ approach 
which involves incorporating redundant circuits (e.g., 
triple modular redundancy or TMR) so that the 
redundant circuitry is inherently hardened to in-coming 
radiation. Problems in this approach are two-fold.  
Redundant circuits must be sufficiently spaced apart 
for hardening-by-design to be effective. This leads to 
wiring difficulties and increased delays. Secondly, for 
commercial applications, it is not viable from 
economics perspective.  Borkar et.al., [1] propose a  
way out in the form of chip designs that support 
‘availability on demand’: chip design supports 
dynamic configuration of circuit redundancy that suits 
the application level availability requirements.  
Generally speaking, for commercial server-based 
applications, there is a growing school of opinion that 
higher-level fault-tolerant solutions would be 
necessary in spite of, and in addition to, low-level 
logic redundancies. For instance, following an 
extensive experimental study on characterising soft 
errors, Karnik et.al., [5] conclude that the overhead of 
protecting each circuit will not just be acceptable and it 
will be necessary to seek a software-based approach to 
error detection and recovery. A similar conclusion is 
arrived in [4] which advocate a dependability layer 
above the computational resource layer to deal with the 
inherent undependability that arises at physical and 
transistor level as we move to emerging technology 
nodes of 45 nm and beyond. 
In what follows, we will focus on the requirements 
of application-level fault-tolerant mechanisms 
necessary to deal with the uncorrected manifestations 
of low-level SEUs and SET induced logic errors. 
These requirements will be used to assess the 
(in)adequacies of popular fault-tolerant mechanisms 
used in responsive computing.  
 
4. Crash Assumption and Responsive 
Fault-Tolerance   
 
Application-level fault tolerance provisioning 
involves replicating the application on several, 
networked computer systems and managing the 
redundant system to mask the failures of a finite 
number of replicas. One of the most common fault 
assumptions made is that a computer system fails only 
by crashing: it either executes the application correctly 
in a timely fashion or stops functioning until re-boot. 
An attractive aspect of crash fault model from a fault-
tolerance perspective is that a functioning system can 
be trusted to generate only correct responses. This 
feature allows the low-overhead primary-backup 
replication scheme [9] to be pursued for obtaining 
responsive fault-tolerant solutions at application level.  
In this scheme, only the primary server performs the 
computation and the normal replication overhead is 
limited to primary disseminating its check-points to 
back-up servers. Only when the primary server 
crashes, an exceptional overhead is incurred for 
detecting the crash event and installing one of the 
backup servers as the new primary; the redundant 
system remains unavailable during this fail-over 
period. The replication scheme has potentials to 
closely match 1-server (unreplicated) performance 
when crashes are rare and fail-over is managed to be 
swift. It guarantees reliability so long as checkpoints 
from the primary server are not erroneous. Thus, crash 
assumption is crucial to this scheme. 
Alternative to primary-backup scheme is active 
replication or state machine replication whereby each 
replica executes the application and the results 
produced by them are subject to a vote or selection at 
the client end. (TMR is a classical example of active 
replication.) When active replication is subject to the 
possibility of having to service several client requests 
being submitted concurrently, replicas must agree on 
an identical processing order over incoming requests 
[10].  The overhead associated with ordering of request 
prior to their processing, slows down the response to a 
request. For this reason, active replication is not widely 
used in real-time applications with stringent response-
time requirements, even though it is not restricted to 
crash assumptions only. 
Primary-backup replication is the traditional 
approach taken in real-time fault-tolerant computing. 
For instance, D Powell et.al., [11] applies this 
approach for managing replicated traffic controllers in 
railway traffic control system. Usually, only one 
backup is used and real-time systems, like [11], take 
advantage of this to minimise fail-over time. Jahanian 
et.al., [12] minimise the normal, check-pointing 
overhead in primary-backup replication by allowing 
state divergence between replicas, subject to 
constraints tuneable at the application level.  
It should be observed that the crash fault model 
encapsulates an optimism that hardware-level errors 
(and also OS- and application-level bugs) are either 
captured and corrected or result only in a computer 
system being unresponsive. Underlying sources of the 
hardware errors induced system crashes are generally 
attributed to component ageing.  From the discussions 
presented earlier, this optimism needs to be re-visited 
for future systems of terascale era.  With the increasing 
likelihood of swift ageing, infant mortality, SEUs 
rendering corrective efforts ineffective and soft errors 
appearing in logic, future systems will crash more 
frequently or even worse be likely to generate 
erroneously computed results. The former dilutes 
primary-backup responsiveness (due to frequent fail-
over) and the latter makes the scheme itself unreliable.  
A way forward appears to be to apply TMR at the 
computer system level. But this unfortunately 
introduces ordering overhead – the very reason 
responsive FT computing relies on primary-backup as 
the default replication scheme. An alternative will be 
to consider the approach advocated/pursued by some 
leading researchers for critical applications as the de 
facto approach. This involves building the units of 
replication with appropriate failure modes and out of 
(off-the-shelf) systems prone to frequent crashes or 
emitting erroneous results. 
 
5. Engineering Announced Crashing 
 
Suppose that a replication unit, RU for short, within 
a primary-backup replication is purpose built to fail 
only by crashing and, additionally, to announce its 
imminent crash to interested destinations. While the 
former strengthens the applicability of the scheme 
applicable, the latter reduces the overhead associated 
with fail-over and thus contributes to dealing with 
frequent crashes quite effectively. The dominant 
component in fail-over unavailability is the latency for 
detecting primary crash. This in turn is determined by 
the frequency of ‘heart-beat’ sent by the primary. A 
high frequency will flood the network and a low one 
leaves the primary crash undetected for a longer 
period. When crashes are announced by the crashing 
primary RU itself, detection becomes as swift as 
message communication and heart-beats can be done 
away with. 
Construction of an RU with Announced Crash (AC) 
features is intuitively simple [13]: active replication for 
computation and primary-backup route for ordering. A 
set of loosely synchronised computer systems perform 
the same computation (as in active replication) and 
compare each other’s response. All of them stop 
functioning as soon as even one of them produces an 
incorrect or no response. (k+1) redundancy can ensure 
AC in the presence of k simultaneous system failures. 
Ordering of requests for active replication within an 
RU is accomplished by entrusting one member to 
decide and others to follow – in a primary-backup 
fashion and not using any explicit order protocol.  
A Fail-stop computer [14] is an example of an RU 
with AC property. It additionally has a stable store 
which retains the process state prior to crashing. This 
additional feature however requires (2k+1)-redundancy 
and the use of an explicit order protocol. For the 
performance degradation due to active ordering, RU 
with stable store is unnecessary, since a distributed 
stable store can be built in the next level with a system 
of interconnected RUs.   
MARS is a practical embedded system that uses RU 
with AC property. Each computer system is ‘hardened-
by-design’ and hence fail-only by crash. To 
incorporate announcement feature, two computers are 
paired to form an RU (k=1) with one of the two 
designated as a shadow. When the shadow computer 
observes that its paired counter-part has been silent 
during its TDMA slot, it emits the result in its TDMA 
slot. Emission from the shadow is an implicit 
announcement that the RU is preparing to halt. (In 
MARS, there is no explicit task ordering even in 
primary-backup fashion but is implicitly done taking 
advantage of time-triggered design approach.) HP’s 
NSAA [16] is an RU with k=2 and its computer 
systems are off-the-shelf and not assumed to be 
‘hardened-by-design’. Lately, MARS proposes TMR-
style replication to improve RU availability (Section 
V, in [17]). 
AC feature considerably increases responsiveness 
when primary-backup is to be implemented over the 
Internet or any asynchronous network where message 
communication delays (between RUs) cannot be 
bounded with certainty. Over such networks, detecting 
the crash of an RU without AC property becomes a 
performance bottleneck.  Indeed, it cannot be 
deterministically solved with guarantees on 
termination even amidst crash failures. This 
impossibility is known as the FLP impossibility [18]. It 
arises from the fact that a crashed RU cannot be 
distinguished with certainty from a slow one. 
Circumventing this impossibility requires complex 
protocols involving rounds of message exchange 
between (functioning) RUs. We refer the reader to [19] 
for a survey on these protocols which are traditionally 
called consensus protocols. (Here, backup RUs need to 
reach a consensus on the operational state of the 
primary.) AC property makes the FLP impossibility 
non-applicable as crashes are announced by the source 
itself and the need to distinguish a crashed RU from 
slow one does not arise.  
 
6. Concluding Remarks 
 
We have pointed out that engineering 
computational nodes with announced crash failure 
mode is the appropriate way forward to achieve 
responsive fault-tolerance in future. The rationale 
comes from the reliability fallouts due to continuing 
development in hardware integration process. These 
fall-outs have been surveyed which led to the 
conclusion that off-the-shelf computers can no longer 
be safely assumed to fail only by crashing and fail 
rarely.  This undermines the use of primary-backup 
replication scheme which our survey establishes to be 
the widely used and highly favoured strategy for 
achieving responsive fault-tolerance.  
 
References 
 
[1]   Baumann, R., “Soft Errors in Advanced Computer 
Systems”, IEEE Design and Test of Computers, pp. 258-266, 
May-June 2005. 
[2]    Santarini, M., “Single Event Effects: Cosmic Radiation 
Comes to ASIC and SOC Design”, www.edn.com, May 
2005, pp. 46- 56.  
[3] Borkar, S., Jouppi, N.P., and Stenstorm, P., 
“Microprocessors in the Era of Terascale Integration”, In 
Proc. IEEE/ACM Conf. Design Automation and Test in 
Europe (DATE07), pp. 237-242, 2007. 
[4]   Serpanos, D., and Henkel, J., “Dependability and 
Security Will Change Embedded Computing”, IEEE 
Computer, 41(1), January 2008, pp. 103-105. 
[5]   Karnik, T., Hazucha, P., and Patel, J., 
“Characterisation of Soft Errors Caused by Single Event 
Upsets in CMOS Processes”, IEEE TDSC, 1(2), April 2004, 
pp. 128-143. 
[6]  Lammers, D., and Wilson, R., “Soft Errors Become Hard 
Truth for Logic”, EETimes Online, March 2004. 
(www.eetimes.com/showArticle.jhtml?articleID=19400052) 
[7]     Reis. G.A., Chang, J., Vachharajani, N., Rangan, 
R., August, D.I., and Mukherjee S.S., “Design and 
Evaluation of Hybrid Fault-Detection Systems”, 
International Symposium on Computer Architecture, 2005, 
pp. 148-159. 
[8]     Latif, M.M., R. Ramaseshan, and R. Mueller, “Soft 
Error Protection via Fault-Resilient Data Representations”, 
In Proc. 3rd IEEE Workshop on Silicon Errors in Logic – 
System Effects (SELSE), Austin, April 2007.  
[9]     Budhiraja, N., K. Marzullo, F. B. Schneider and S. 
Toueg, F., “The Primary-Backup Approach”, In Distributed 
Systems (Ed. S. Mullender), pp. 199–216, 1993. 
[10]    Schneider, F., “Replication management using the 
state-machine approach”, In Distributed Systems (edited by 
S. Mullender), ACM Press, pp. 169–197, ACM Press, 1993. 
[11]   Essame D., J. Arlat and D Powell, “PADRE: A 
Protocol for Asymmetric Duplex Redundancy”, in IFIP 7th 
Working Conference on Dependable Applications in Critical 
Applications (DCCA7), pp. 213-232, January 1999.  
[12]     Zou, H., and F. Jahanian, “A Real-Time Primary-
Backup Replication Service”, IEEE TPDS, 10(6), pp. 533-
548, June 1999. 
[13]      Shrivastava S K, Ezhilchelvan P D, Speirs N A, Tao 
S and Tully A: Principal Features of  the VOLTAN family of 
Reliable Node Architectures for Distributed systems. IEEE 
Transactions on Computers, C-41(5), pp.542-549, 1992. 
[14]   Schneider, F., "Byzantine Generals in Action: 
Implementing Fail-Stop Processors", ACM Transactions on 
Computer Systems, Vol. 2(2), pp. 145-154, May 1984. 
[15]       Kopetz, H., A. Damm, C. Koza, M. Mulazzani, W. 
Schwabl, C. Senft, and R. Zainlinger, "Distributed Fault-
Tolerant Real-Time Systems: The Mars Approach", IEEE 
Micro,  9(1), pp. 25-40, 1989. 
[16]      Bernick, D., B.Bruckert, P.D. Vigna, D. Garcia, R. 
Jardine, J. Klecka and J. Sullen, “Non-Stop Advanced 
Architecture”, In Proc. International Conf. on Dependable 
Systems and Networks (DSN05), pp. 12-21, 2005. 
[17]    Kopetz, H., and G. Bauer, "The Time-Triggered 
Architecture", Proceedings of the IEEE, 91(1), pp. 112-126, 
January 2003. 
[18]   Fischer, M.J., N.A. Lynch, and M.S. Paterson, 
“Impossibility of Distributed Consensus with one faulty 
Process”, Journal of the ACM, 32(2): 374-382, April 1985. 
[19]      Raynal, M. and M. Singhal, “Mastering Agreement 
Problems in Distributed Systems”, IEEE Software, 18(4), pp. 
40-47, July 2001.  
