Floating point

Waldek Hebisch hebisch at math.uni.wroc.pl
Tue Sep 1 14:06:03 CEST 2020

On Mon, Aug 31, 2020 at 08:06:45PM -0700, scott andrew franco wrote:
> Waldek,
> Sure, 48 bits vs 64 bits. Why didn't it truncate the mantissa on conversion to float, ie, b := maxint?
> I would have expected something like zeros on the right side.

On x86_64 gpc uses SSE unit for floating point.  So real has
64-bit IEEE format, with 53 significant bits.  IEEE says
that operations should round result and that is what happens:
closest representable number is 2^63, that is maxint + 1.

Concerning testing floating point, standard requires almost
nothing.  So it boils down to quality of implementation,
and I would argue that 64-bit IEEE format + optimizations
in gpc give high quality implementation.  But IEEE rules
(as other floating point rules) may produce results
which does not agree withj naive intuition.  In some
cases optimizations lead to results which are slightly
different than literaly performing operations according
to IEEE rules -- I consider this normal (otherwise it
would be almost impossible to optimize floating point).

Back to testing: in the Pascal spirit you should test
if realative error of operations does not exceed
assumed maximal error.  Similarly for range.  Standard
does not give you _any_ constraints on range or
accuracy.  IIUC having single floating point number,
(that is 0) is legal (but useless) implementation.
I would say that 20 significant bits and 8 exponent
bits is probably reasonable lowest limit of accuracy.

                              Waldek Hebisch

More information about the Gpc mailing list