Floating Point Addition

BlitzMax Forums/BlitzMax Programming/Floating Point Addition

thalamus

(Posted 2008) [#1]

I've explored a few threads on here regarding floating point numbers and their accuracy, but I'm still nonplussed as to how or why these problems occur.

Basically, I'm writing a program to strip and collate information from an XML file. I need to be able to add up liquid volume amounts to 3 decimal places, and currency amounts at 2 decimeal places. Accuracy is paramount.

Is there an accurate way of doing this? I did consider converting to strings and then adding digits manually (the old fashioned way I used to add using 6502 assemly!) but it seems hideously draconian.

Any suggestions?

TomToad

(Posted 2008) [#2]

Use fixed point instead. since the liquid is always to 3 decimal places, you would represent 1 liter as 1000 and 3.145 liters as 3145. When you go to print out the totals, just divide by 1000 first.
Only thing to keep in mind is that when multiplying fixed point numbers, you need to divide the result by 1000. Because 1.5 x 1.5 = 2.25 but 1500 x 1500 = 2250000.

Canardian

(Posted 2008) [#3]

The most problems with Float and Double occur because people don't specify Float or Double after each operand.

If you want 100% accurate calculation, you must always specify Float or Double. I would use Double only, since it's faster than Float in BlitzMax (and much faster in C++).

A simple accurate calculation would look like this:
SuperStrict
Local a:Double = 0.01:Double
Local b:Double = 0.04:Double
Local c:Double = a+b
Print a
Print b
Print c

The output will look like this (which is accurate):
0.010000000000000000
0.040000000000000001
0.050000000000000003

Using Floats the result would look like this (not so accurate, and slower to compute):
0.00999999978
0.0399999991
0.0499999970

AlexO

(Posted 2008) [#4]

If you want 100% accurate calculation, you must always specify Float or Double. I would use Double only, since it's faster than Float in BlitzMax (and much faster in C++).

I'd like to see some evidence of this. I had always believed floats were faster than doubles. Doubles are nothing more than 64bit floating point numbers. And doesn't blitzmax default to a float when you type something like "3.5"? I've never heard of 64bit being faster than 32bit numbers on a 32bit architecture.

Canardian

(Posted 2008) [#5]

I've never heard of 64bit being faster than 32bit numbers on a 32bit architecture.

That's correct. But how many systems are still 32-bit? Even if the OS is 32-bit, nearly all CPUs/FPUs are 64-bit, and they handle internally 64-bit floating numbers, that's why there is no conversion or rounding when using 64-bit floating numbers aka Doubles in programming languages.

On a non-Hyperthreading, non-Multicore CPU, which is also internally 32-bit, Floats should be faster than Doubles.

I will make further tests on a 32-bit CPU (486 CPU probably), as well as on a 64-bit OS.

Vilu	(Posted 2008) [#6]

I ran some benchmarks of my own on two 32bit XP Pro rigs, one of which being a single core 64-bit Athlon processor, and the other one an older single core 32-bit Pentium-M processor.

Here's the test proggies that make 50 million iterations and three different operations per pass. They are identical except the other one is using floats and the other one doubles.

Floats:

Superstrict

Local iterations:Long = 50000000
Local array:Float[iterations]

' populate array
For Local i:Long = 1 To iterations
	array[i] = Float(iterations) / Float(i)
Next

' run the test
Local timer:Long = Millisecs()
For Local i:Long = 1 To iterations - 1
	array[i] = array[i] / array[iterations-i]	' division
	array[i] = array[i] * array[iterations-i]	' multiplication
	array[i] = array[i] + array[iterations-i]	' addition
Next
Local result:Long = Millisecs() - timer
Print "Float operations: " + result

Doubles:

Superstrict

Local iterations:Long = 50000000
Local array:Double[iterations]

' populate array
For Local i:Long = 1 To iterations
	array[i] = Double(iterations) / Double(i)
Next

' run the test
Local timer:Long = Millisecs()
For Local i:Long = 1 To iterations - 1
	array[i] = array[i] / array[iterations-i]	' division
	array[i] = array[i] * array[iterations-i]	' multiplication
	array[i] = array[i] + array[iterations-i]	' addition
Next
Local result:Long = Millisecs() - timer
Print "Double operations: " + result

I ran 5 tests of both on each machine and noted the average times. The compilation was done in Release mode.

Float operations on the generally faster 64-bit processor took 2894 ms on average, and Double operations 2957 ms. So the floats were marginally faster (some 2% or so)

Float operations on the slower 32-bit processor averaged at 4992 ms, while the double operations were 4852 ms. Surprisingly, while the difference is still neglectible, the doubles were some 3% faster than floats.

So, according to this quick and dirty test, there's not much of a difference between them performance-wise when the OS is 32 bit.

Anyone running a 64-bit OS care to run these tests for comparison?

ImaginaryHuman

(Posted 2008) [#7]

I have not found Doubles to be faster than Floats at all, but I'm on a Mac and it may be a different scenario. Double and Long are MUCH slower than Int or Float, even on a very recent Intel Core2 Duo.

As to the float accuracy, I agree about using fixed point to represent accurately. I am using fixed point to represent High Dynamic Range images, because I know the distribution of accuracy will be equal across the full range of numbers.

Here's the thing. In order to represent any number with perfectly accurate precision, you need INFINITE decimal places.

Take for example the fraction 1/3. It should be .3333333 reoccurring infinitely. If you want perfect accuracy you have to have infinite decimal places because as soon as you cut the number off you have lost some accuracy, albeit usually quite small.

Imagine then what happens to some numbers which really need hundreds or millions or trillions of decimal places to be accurate, but only get accuracy to say 10 decimal places? That's quite a lot of truncation. 10.33334 is not the same as 10.33337

This is the problem, then.. there are not enough digits in the computer's representation of the number to accurately represent all numbers. `All numbers` means infinite varieties of numbers, and representing infinite numbers needs infinite decimal places.

Floats only have 32 bits of data to represent them. It just so happens, due to the way it represents numbers, that SOME numbers are representable perfectly. After all if 10.5 is all you need to represent and the Float format is able to exactly specify 10.5 and not 10.4999993454 or something, then you are going to get a perfect representation of your number. But if the float format is not able to represent exactly the number you want, due to the number of decimal places it needs and also due to whether your number falls onto one of the `boundaries` of float's resolution, then you're going to get an inaccuracy.

Just as your screen has a `resolution` so does the accuracy of number representation in Floats and Doubles. If you tried to draw a pixel at coordinate 0.5,0.5 you obviously aren't going to get an accurate plotting of the pixel because it should only really be fitting into half of the pixel. But if you draw at 1,1 you will get perfect representation because the resolution at which you are drawing is within the resolution of the screen. Similarly if you try to be more accurate than the resolution of a Float or a Double you are not going to be able to store your numbers with perfect precision. Some numbers if they happen to fall on a `pixel boundary`/number boundary will be perfectly accurate, but other numbers will fall between and won't be representable with full accuracy.

To be more accurate with your numbers you can use Doubles instead of Floats but even Doubles have the same issue. You could have a number with 1 million decimal places but it won't perfectly represent some numbers, like 1/3.

The resolution of number representation and how it spreads out over the range of representable numbers is different for floats than for integers. The bigger a floating point number gets the more digits you need to represent the integer portion of the number, and the less digits you are going to have left to represent a decimal portion. So the number becomes less and less `precice` in a way the larger the number gets. The smaller the number gets the more bits you have left for representing the number, so it can be more accurate. In other words the `floating point` makes as best use of the space it has available within 32 bits (or 64) depending on the size of the value it is trying to represent. It can shift the decimal point depending on number size. The tradeoff is that you lose accuracy on larger numbers and you still have the issue of there only being so many digits to use. You just aren't going to represent 1/3 accurately whether its 100 million and one third or 0.333333 recurring, no matter how many digits you use. But maybe it will represent 1.0 perfectly.

Integers on the other hand, or fixed point, allows you to ensure that you will get exactly the number of decimal places that you want, all the time, no matter what size the number is. Obviously this wastes bits when the amount of space for the integer or decimal part is greater than the amount of space you need to store that number. But that's the tradeoff. You get the exact same accuracy with larger numbers as with smaller ones, but then because you can't `float` the decimal point you can't adjust how many bits are used. So you can't represent as large numbers. Floats for example (if I recall) can represent numbers up to 2^128 and Doubles up to 2^1024. Obviously an Integer does not have 128 bits and a Long does not have 1024 bits - one of the disadvantages of using integers compared to floating point, but then you have the advantage of consistent representation.

For cases where you know that you will be representing numbers to a given number of decimal places, like to 3 decimal places, fixed point integer math gives you exactly the precision you want. Since you don't care about what the fourth decimal place is - except maybe for rounding purposes, and because you are okay with rounding past 3 digits, you will get `perfect numbers` for all numbers you want to represent. But it's when you are not okay with any rounding at all, or any imprecision at all, to any number of decimal places, that NEITHER integers nor floating point numbers will be able to represent everything perfectly. Some numbers will work great, some won't, it depends on the resolution and the available bits.

You will find that some numbers that are easily representable as integers are impossible to represent perfectly as Floats. For example (haven't tested this) I seem to remember that a float can't represent something like 4.0 perfectly. Obviously this would be just 4 as an integer, but a float might be 3.999999734 or something. You can improve that by switching to a Double, which I seem to recall actually can represent integer values better, so you can actually get 4.0 with it. But as said already, neither floats nor doubles nor ints nor longs will be able to give you infinte accuracy on all numbers. IT's only when you have a subset of numbers or you're okay with omitting some numbers that you can get perfect representation.

AlexO

(Posted 2008) [#8]

On a non-Hyperthreading, non-Multicore CPU, which is also internally 32-bit, Floats should be faster than Doubles.

In regards to a casual market, I'd still venture to guess a good bit of systems are still non-HT, non-multicore. So yea, while most modern CPU's you buy today are multi-core. I tend to always think of these generalizations in terms of what the 'target' audience has, which ever audience that may be (casual games, hardcore games, business apps, etc).

But how many systems are still 32-bit?

good question, would be a nice statistic to have. does Steam's user survey pick this up with all their other stats?

Czar Flavius

(Posted 2008) [#9]

But how many systems are still 32-bit?

A significant amount?

Russell

(Posted 2008) [#10]

Too bad 'Print' doesn't have formating capabilities like C's printf() to limit the output to x-number of digits.

Russell

Azathoth

(Posted 2008) [#11]

Doesn't matter if you're using a 32-bit system or 32-bit floats/64-bit doubles, the Intel FPU always uses 80-bit numbers.

Vilu	(Posted 2008) [#12]

Doesn't matter if you're using a 32-bit system or 32-bit floats/64-bit doubles, the Intel FPU always uses 80-bit numbers.

Well, I guess that explains my benchmark results. :)

Czar Flavius

(Posted 2008) [#13]

3 decimal places, and currency amounts at 2 decimeal places

Back to the original problem, have you tried some testing code to see how accurate/inaccurate using floats/doubles is in practice? The main problem with floats is their unsightly representation to the user. Unless you are doing some darn accurate experiments (more than 3 decimal places we're talking here ;) ) then floats should be negligably inaccurate.

Canardian

(Posted 2008) [#14]

I wrote a Decimals() function which returns a double or float as string with the wanted number of decimals.
This function can be used with the Print statement to format the output, or even to round a double or float number to the wanted number of decimals (by typecasting Double(Decimals(a,i))):

SuperStrict



Local a:Double=12345.9129546789123456789:Double
For Local i:Int=0 To 6
	Print Decimals(a,i)
Next
End



Function Decimals:String(a:Double,n:Int)
	If n<0 n=0
	Local s:String=String(a)
	Local p:Int=s.Find(".")
	Local u:String=Mid(s,p+2,n+1)
	Local c:Int=Mid(u,n+1,1)>="5"
	Local b:Int=Int(Mid(u,1,n))+c
	If n>0
		Return Mid(s,1,p)+"."+b
	Else
		Return Int(Mid(s,1,p))+c
	EndIf
EndFunction