Methodology to systematically identify and isolate bugs in floating point implementation in highperformance multiple CPU computing systems is formulated. A validation suite is written and tested. Re sults show improper implementation. Proper implementation guidelines are suggested and prototyped.
Introduction
High-performance computers often sacrifice correctness in order to achieve throughput, particularly when it comes to floating point operations. "It is the TFLOPS that count, even if a few of the trillion operations are meaningless thereby rendering the end result absurdseems to be the attitude of most vendors of computing equipment that deal with floating point operations. This has been observed by the author [l] , as well as others [2] . The principles behind correct floating point arithmetic is universally understood and accepted [3] . Floating point Units, that are implemented in hardware at the chip level in most modern microprocessors do a correct and well-meaning implementation [4, 51 of the universally accepted IEEE 754 [6] standards. That is because hardware is much more carefully engineered than system and application software and it is not possible to sell faulty hardware and get away with it easily [7, 9, $1. However in system software, it is possible to support floating point operations very badly and succeed. A systematic apathy towards the correct implementation of floating point standards at all levels of system software implementation of high-performance computers has resulted in alarming situations where such computers regularly produce completely absurd results without any kind of error or warning messages.
And there exist many people (some of whom I know) who accept these results without suspicion. While it k nobody's case to condone such lenient attitude on behalf of the users, what we, the computer scientists and numerical analysts can do is t o stem the rot and force the vendors to produce computing environments that implement correct floating point arithmetic. We, at SERC, have taken the following approach:
0 Make people aware of these bugs by speaking out at classrooms, conferences, etc., writing literature explaining floating point arithmetic [lo] and high- the floating point values go from one virtual address space to another disjoint virtual address space very easily and it is difficult to locate the original bug that initiated the syndrome, the codes in these machine are often not easily serializable as they depend strongly on the multicomputer configuration on which they are intended to be run, and on serialization, many floating point anomalies are not reproducible because they depend on the nominal (normalized) values of the variables in question and such values in turn are dependent on the task-partitioning scheme.
Thus errors must be detected in the actual parallel computing configuration, as topologically closely as the point where it started going absurd. Therefore, our suite does not attempt to serialize and then validate floating point operations. It tracks down the implementation anomalies "in place", as they actually occur in a given multiprocessor/ multicomputer configuration. It understands architectures of parallel and distributed computers. This feature makes it unique among floating point validation suites that exist (e.g. the one from NAG [13]) as of now. The rest of the paper is organized as follows. In section 2, we discuss the organization of the validation suite. We tested a number of single-CPU, parallel and distributed computers using this suite. The results that might interest the high-performance computing community are presented in section 3. We conclude in section 4.
The suite
The suite tests if the programming environment handles what had been described in the subsections that follow.
Abnormal IEEE 754 Values
The suite checks for the following:
Quiet NaNs Are they being produced when absurd operations are attempted? Does the system produce a quiet NaN, when quiet NaNs are used as the input operand of a typical floating point operation? In a parallel computer, is the (nodeid, pid) pair being embedded/ pointed to by the nonzero bit-field of the QNaN?
Signaling NaNs Do they make the FPU/CPW trap? Inexacts What is the way to handle them? Does the presence of one inexact set of a syndrome of a chain of inexacts, thereby rapidly losing precision?
Using this part of the suite is mabpower-intensive and involves cpmprehensive documentation activities too.
Machine Constants
The suite also computes:
Machine-epsilon The minimum number that makes a significant difference in floating-point computation.
Dwarf
The minimum non-zero representable number that can be stored in a IEEE 754 floating point variable.
These are dependent rather on the floating-point format than on the programming language used. We recognize that and the suite allows the user to examine the bit-fields of these two important constants. This also is the reason, we did not write the suite in Fortran-90 or attempt to portray the properties of the machine constants t o be associated with any programming language. Our observation has been that compiler implpementors are equally apathetic to floating point issues. Thus there exists no ideal programming language in which it is the most appropriate to code the suite. We used C as there is a C compiler on all machines and because we can set/test bit patterns in C.
Rounding Modes and Underflow
The suite also decides the rounding modes that are usable on the FPU, and whether there is any provision for the user to set/test that mode. The suite can decide whether the underflow is gradual or abrupt and whether the user can set/test that feature.
Parallel Computing Aspects
On Distributed memory machines, the suite tests for the message-passing primitives unsuspectingly passing on any of the abnormal bit-patterns that are listed in section 2.1. On shared-memory machines, there should be a mechanism t o trap these values from crossing the virtual address boundaries of processes/tasks, even if the abnormal IEEE 754 floating point value remains in the same physical storage location. I could not, until now, figure out a good way of doing that. Therefore, I could not design such a test in the suite. I just tested for abnormals being able t o cross task virtual address space boundaries with gay abandon on shared memory machines.
Results

Coding
The suite was coded in Portable C. As 64-bit integers have to be declared in different ways to compile correctly on C compilers hosted by different highperformance computers, certain #defines need t o be altered to port the suite. On message-passing machines, the suite uses and links with the messagepassing library used for application programming. For example, on the IBM SP2, it uses IBM's MPI [14] . The suite has a large number of routines t o set and test bitpatterns for different floating point data types, both normal and abnormal. See, for example, the figures 1, 2 and 3 for the representations of infinities, Signaling NaNs and quiet NaNs, respectively. All optimization switches are turned off when compiling the validation suite code.
Architecture
We describe three high-performance architectures for which tested the suite. Table 1 shows some of the important results of the validation runs. There exist implementation deficiencies in all the systems, despite the hardware being capable of supporting a proper implementation.
DEC
Prototyping
In keeping with the third item in the list of steps t o tackle improper floating point arithmetic implementation as we mentioned in the introduction as well as t o develop our own ways of researching in these issues, we developed a runtime-library/ micro-kernel pair for our own distributed memory architecture built from 16 IBM PC motherboards, each with an Intel 80386/387 and 2MB main memory [15] . As our parallel computer is not a high-performance machine I did not list it in Table 1 . However, it handles all the abnormal patterns properly(i.e, the way I think they should be handled). It receives all the traps/ signals from the 80387s, handles them by warning the user and/ or terminating the tasks. Abnormal values cannot go from one VA space to another, without causing a trap in the interprocess/ interprocessor communication interface. The rounding modes can be set by the user. De-normals are reported and handled as per the user's directives. Optionally, the user can write and install her own trap handlers. The RTL initializes all un-initialized floating point variables t o SNaNs and the micro-kernel informs the user, whenever such an operand enters the 80387 at any node. The QNaNs keep a record of the (nodeid, pid) pair where the NaN was first produced. See [20] 
Conclusion
It is time we stop going the teraflop way just for the TFLOPS' sake, and pay a little bit of long-due attention on correctness of floating point operations. It may mean a slight reduction in the execution speed and having t o undertake a significant re-design effort of system software. It also will mean having t o design, develop and run validation suites appropriate to the many different levels of system software in modern high-performance computers. But the result is certainly worth the effort. In fact, any development in this field is badly awaited and most welcome. If I, coding alone can do it, the big names in high-performance computing can certainly do it too. You may or may not agree with my way of solving the problem or even my way of looking at the problem, but surely you will agree that there is a problem. It is time, therefore, that we all get our act together and make floating point arithmetic correctly implemented on high-performance computers. 
