Faster than Inverse Square Root

BlitzMax Forums/BlitzMax Programming/Faster than Inverse Square Root

col(Posted 2009) [#1]
Hi all.
This is such a great community that i thought I'd return something.
I'm sure every programmer has come across the following equation

1.0 / Sqr ( x )

where x is a FLOAT value.

To cut a long story short, there's some mystery code in Quake3 that performs an Inverse SquareRoot. Its a hack that takes advantage of 32 bit registers and the way they work in the CPU and runs 4x faster than the standard 1.0/Sqr(x) code. The original is in C so it'll probably be faster than Blitz, but only a fraction I thought :)

So naturally I thought I'd give it a go in BlitzMax :D

Indeed it runs faster. I'm only using a laptop and i'm getting upto approx 370 % speed increase using the 'inline method'.


Local x:Float = 0.9 'Value to 1.0/Sqr() of

Local xhalf:Float = 0.5 * x
Local f2i:Int Ptr = Int Ptr(Varptr x)
Local i : Int = f2i[0]
f2i[0]=$5f3759df - ( i Shr 1)
	
x=x*(1.5 - xhalf * x * x)
x=x*(1.5 - xhalf * x * x) 'Use for geater accuracy
'x=x*(1.5 - xhalf * x * x) 'Use for even greater accuracy


value will indeed result to 1.0/sqr(x)

Some speed testing code. Im not sure what is fair grounds for speed testing due to the overhead in function calls etc. I'm sure most people would call the equation inside a function so maybe there could any test with the original 1.0/sqr(x) inside a function to give a more completeness to the tests.



Theres some interesting discussions and links to theories etc here - http://www.codemaestro.com/reviews/9

Just thought I'd give back to the community.
Thanks all


Pete Rigz(Posted 2009) [#2]
Interesting, but these are the results on my work PC, AMD Athlon 64 3200+

Inline Method takes 10532 Millisecs
Function Call By Variable Method takes 16072 Millisecs
Function Call By Reference Method takes 18653 Millisecs
Original 1.0/Sqr( x ) takes 9529 Millisecs()

I get the feeling that the FPU makes a lot of these type of hacks obsolete these days, but it's always good to know for the slower PCs :)


Kistjes(Posted 2009) [#3]
There is a huge difference between the release or debug versions:

Release version:
Inline Method takes 917 Millisecs
Function Call By Variable Method takes 1322 Millisecs
Function Call By Reference Method takes 1216 Millisecs
Original 1.0/Sqr( x ) takes 4379 Millisecs()

Debug version:
Inline Method takes 6541 Millisecs
Function Call By Variable Method takes 9872 Millisecs
Function Call By Reference Method takes 11007 Millisecs
Original 1.0/Sqr( x ) takes 8558 Millisecs()

Dell XPS1710, WinXP SP3


Pete Rigz(Posted 2009) [#4]
ahh yeh forgot about that:) here's the result in release mode:

Inline Method takes 1251 Millisecs
Function Call By Variable Method takes 3121 Millisecs
Function Call By Reference Method takes 2229 Millisecs
Original 1.0/Sqr( x ) takes 2720 Millisecs()


kog(Posted 2009) [#5]
Dual Core 3 GHz

Release:

Inline Method takes 695 Millisecs
Function Call By Variable Method takes 1017 Millisecs
Function Call By Reference Method takes 927 Millisecs
Original 1.0/Sqr( x ) takes 1623 Millisecs()


ImaginaryHuman(Posted 2009) [#6]
Sounds like you still need a lookup table or something with all those calculations.


Shagwana(Posted 2009) [#7]
I just love this little tid bits for doing funky stuff :)


col(Posted 2009) [#8]
Just to note - that you dont really need the second or third line of multiplies at the end of the code. In some of Q3 they only use 1 multiply at the end. This results in even more speed. Although there is some loss in accuracy. It depends on what you need it for.

The main reason i was trying this out was for optimising reasons. The processors of today are so fast which only means we make programs that are more demanding (anyone working on realtime ray tracing ? :D ), but it just goes to prove that even in todays super fast cpu and gpu world there are still ways to squeeze even more speed for your progams. Its also these kinda things that make your work stand out from the rest. Obviously on its own this trick is limited just to itself, but imagine using other optimising tricks throughout the whole of your code.

I'm sure the good old days are here to stay for games programmers ;) hehe


col(Posted 2009) [#9]
Thankyou all for trying the code :P
So it is a lot faster :)
Thats cool and its also over 4x times in some setups :P

Oh, and thanks Kistjes, I meant to mention about Debug and Release Modes begin vastly different. Thankyou for pointing that out. And WOW, you have more than 400% speed increase :P

Other tricks I'm looking into are utilizing cache sizes. I wrote some math code to test its efficiency similar to the above examples. Then I added 1 single line and the speed of execution dropped to half. It was taking twice as long. I read up that if you are working on the same variables, there is only so much that the cpu cache can use to use for calculations at once in order to improve effeciency. By me adding one extra line working on the same few variables it meant the cpu had to wait for the result from the previous calculations. In effect, the CPU was stalling and waiting for calculations to finish. Further reading lead me to try to write the code to work on different variables and spread the workload across several variables if its possible. Thereby letting the cpu finish its work on the variables in its pipeline and cache, and in the meantime have it start working on other variables so there is no waiting. Where you can gain extra speed is simply by not working on the same variable line after line but by staggering the variable involvement across several lines of code. Unfortunately I dont have any working examples. I also think it may be a bit pointless posting code as cpus and thier cache sizes can differ drastically, and whereas I've been working purely only on my laptop there will be inconsistances across different machines.

Also i know BMax is compiled and i imagine there is a lot of optimisation done where possible when its compiled to Release Mode. So its not like we have 'total' control of when and where we can have these features under our control, but its interesting none the less.


Shagwana(Posted 2009) [#10]
If your intrested I have some other aproximation code in my code arc posts here, try n get your head round those bad boys :)


col(Posted 2009) [#11]
Hi Shagwana.
Just took a look :P

Wow,
The '2d distance calculation and approximation' has some serious bit shifting going on :P hehe

Is that your own work ? :P Its good either way.

I also read that Bit-Shifting on todays processors is so fast in terms of cpu cycles that its almost for free :D


col(Posted 2009) [#12]
Did you read about the part where the Inv Sqr solution above works in such a way that it bit shifts the float point variable :D Apparently the method is years old, so old that none of the math gurus can remember the origin :P ( read the link where there are more links hehe )


Nate the Great(Posted 2009) [#13]
works in such a way that it bit shifts the float point variable


oh that makes more sense, I kinda see how it works now but its still not completely clear.


therevills(Posted 2009) [#14]
Pentium D 3.2Ghz:

Debug On
Inline Method takes 9796 Millisecs
Function Call By Variable Method takes 14474 Millisecs
Function Call By Reference Method takes 15476 Millisecs
Original 1.0/Sqr( x ) takes 8669 Millisecs()

Debug Off
Inline Method takes 1369 Millisecs
Function Call By Variable Method takes 2288 Millisecs
Function Call By Reference Method takes 2068 Millisecs
Original 1.0/Sqr( x ) takes 3533 Millisecs()


xlsior(Posted 2009) [#15]
Standard Blitzmax 1.34 on a Core 2 Duo E6600 @2.4GHz, debug off:


Inline Method takes 975 Millisecs
Function Call By Variable Method takes 1306 Millisecs
Function Call By Reference Method takes 1260 Millisecs
Original 1.0/Sqr( x ) takes 3737 Millisecs()



However, you can still do better:
Identical benchmark program, in a Custom Blitzmax 1.34 on the same machine (Compiled for speed with Brucey's BMK NG, instead of using the default Blitzmax settings to compile for size)

Inline Method takes 881 Millisecs
Function Call By Variable Method takes 1306 Millisecs
Function Call By Reference Method takes 1197 Millisecs
Original 1.0/Sqr( x ) takes 3689 Millisecs()



That one gives an additional ~10% speed increase using the inline method...


col(Posted 2009) [#16]
@Nate the Great

How the code works

x is the Float value that you want to find the square root of
f2i is an integer pointer to a float pointer but using an integer cast. Integer pointers cant point to float pointers. I used 'f 2 i' meaning Float To Integer
i is the integer part of the float value x

then the value of i is set to $5f3759df minus itself shifted one bit to the right. This right shift drops the least significant bit of i, which means its halving it.

Remember that its manipulating the value of x using its memory address, so all calculations will affect the value of x.

Then the multiplies near the end 'tune in the approximation', the more that you use them, the closer to the correct result it gets. After one iteration the value is damn close, after 2 even closer, and after 3 iterations its the exact number. Well, exact enough for a value thats stored in 32 bits.

The link and the links that are there explain it way better than i ever could :P


@xlsior
Thats great. Thanks for the heads up. Im out of time this morning ( tut work again !! ), but I'll take a look asap, It wont be this evening, Im going to a concert :D


xlsior(Posted 2009) [#17]
Col: If you are looking at the BMK NG, to optimize for speed you'll need both the modified bmk,exe, and a custom.bmk in your /bin folder with the following contents:


addccopt optimization -O3



then recompile all modules.

Speed boost in non-graphical stuff can be anywhere from 0% to 25%, depending on the program in my benchmarks... Executable size increases slightly, but typically only ~10% or so.

I just duplicated my entire Blitzmax folder to a seperate folder, and recompiled with the tweak. That way I have a choice which version to use depending on what's more important, speed or file size.