Fast InvSqrt in Max

BlitzMax Forums/BlitzMax Programming/Fast InvSqrt in Max

Knotz(Posted 2006) [#1]
After reading this article:

http://www.beyond3d.com/articles/fastinvsqrt/

And having some time to spare while waiting on the Sint i made a BM version of the original algorithm.

Original c version:
float math_InvSqrt (float x){
   float xhalf = 0.5f*x;
   int i = *(int*)&x;
   i = 0x5f3759df - (i >> 1);
   x = *(float*)&i;
   x = x*(1.5f - xhalf*x*x);
   return x;
}

float math_Sqrt (float x)
{
   float xhalf = 0.5f*x;
   int i = *(int*)&x;
   i = (0xbe6f0000 - i)>>1;
   x = *(float*)&i;
   x = x*(1.5f - xhalf*x*x);
   return x;
}


and my Blitzmax version:

Import "InvSqrt.c"

Extern 
	
	Function InvSqrt:Float(x:Float) = "math_InvSqrt"
	Function Sqrt:Float(x:Float) 	= "math_Sqrt"
	
End Extern

Function InvSqrtKnotz:Float(x:Float)

	Local xhalf:Float = 0.5*x
	Local i:Int = (Int Ptr(Varptr(x)))[0]
	i = $5f3759df - (i Shr 1)
	x = (Float Ptr(Varptr(i)))[0]
	x = x*(1.5 - xhalf*x*x)
	Return x

End Function

Function InvSqrtAntMan:Float(x:Float)

	Local i = $5f3759df - (Int Ptr(Varptr(x))[0] Shr 1)
	Local ox:Float = (Float Ptr(Varptr(i))[0])
	Return  ox *(1.5 - ((0.5*x)*ox*ox))
	
End Function

Function InvSqrtFredborg:Float(x:Float)

	Local i:Int = $5f3759df - (Int Ptr(Varptr(x))[0] Shr 1)
	Return  (Float Ptr(Varptr(i))[0]) *(1.5 - ((0.5*x)*(Float Ptr(Varptr(i))[0])*(Float Ptr(Varptr(i))[0])))
	
End Function

Function InvSqrtSOS:Float(x:Float)

    Local xhalf:Float = 0.5*x
	Local i:Int = $5f3759df - (Int Ptr(Varptr(x))[0] Shr 1)
	Local ox:Float = (Float Ptr(Varptr(i))[0])
	Return ox *(1.5 - (xhalf*ox*ox))
	
End Function

Function InvSqrtAza:Float(x:Float)

	Local xhalf:Float = 0.5*x
	Local ip:Int Ptr=Int Ptr(Varptr(x))
	ip[0]=$5f3759df-(ip[0] Shr 1)
	Return x *(1.5 - (xhalf*x*x))

End Function

Function InvSqrtPlain:Float(x:Float)

	Return 1.0/Sqr(x)
	
End Function

Function SqrtKnotz:Float(x:Float)

	Local xhalf:Float = 0.5*x
	Local i:Int = (Int Ptr(Varptr(x)))[0]
	i = ($be6f0000 - i) Shr 1
	x = (Float Ptr(Varptr(i)))[0]
	x = x*(1.5 - xhalf*x*x)
	Return x

End Function

Function SqrtAntMan:Float(x:Float)

	Local i = ($be6f0000 - (Int Ptr(Varptr(x))[0])) Shr 1
	Local ox:Float = (Float Ptr(Varptr(i))[0])
	Return  ox *(1.5 - ((0.5*x)*ox*ox))
	
End Function

Function SqrtFredborg:Float(x:Float)

	Local i:Int = ($be6f0000 - (Int Ptr(Varptr(x))[0])) Shr 1
	Return  (Float Ptr(Varptr(i))[0]) *(1.5 - ((0.5*x)*(Float Ptr(Varptr(i))[0])*(Float Ptr(Varptr(i))[0])))
	
End Function


Function SqrtSOS:Float(x:Float)

    Local xhalf:Float = 0.5*x
	Local i:Int = ($be6f0000 - (Int Ptr(Varptr(x))[0])) Shr 1
	Local ox:Float = (Float Ptr(Varptr(i))[0])
	Return ox *(1.5 - (xhalf*ox*ox))
	
End Function

Function SqrtAza:Float(x:Float)

	Local xhalf:Float = 0.5*x
	Local ip:Int Ptr=Int Ptr(Varptr(x))
	ip[0]=($be6f0000 - ip[0]) Shr 1
	Return x *(1.5 - (xhalf*x*x))

End Function

Function SqrtPlain:Float(x:Float)

	Return Sqr(x)
	
End Function

Function Benchmark(label:String, func:Float(x:Float))

	Local startTime:Long = MilliSecs()
	For Local ii:Int=1 To 100000000
		func(Float(ii))
	Next
	Local endTime:Long = MilliSecs() - startTime
	Print label + ":~t~t" + endTime + "ms"

End Function

Print "~nInvSqrt:"
Benchmark("Raw C",InvSqrt)
Benchmark("Knotz",InvSqrtKnotz)
Benchmark("Artman",InvSqrtAntman)
Benchmark("Fredborg",InvSqrtFredborg)
Benchmark("SculptureOfSoul",InvSqrtSOS)
Benchmark("Azathoth",InvSqrtAza)
Benchmark("1/Sqr()",InvSqrtPlain)

Print "~nSqrt:"
Benchmark("Raw C",Sqrt)
Benchmark("Knotz",SqrtKnotz)
Benchmark("Artman",SqrtAntman)
Benchmark("Fredborg",SqrtFredborg)
Benchmark("SculptureOfSoul",SqrtSOS)
Benchmark("Azathoth",SqrtAza)
Benchmark("Sqr()",SqrtPlain)



Some more information can be found here:

http://www.math.purdue.edu/~clomont/Math/Papers/2003/InvSqrt.pdf

I've included the entries from the posts in the code. I've also added a fast implementation of the regular Sqr().

Have a nice weekend.


AntonyWells(Posted 2006) [#2]
Here's an optimized version,

Function InvSqrtFast:Float(x:Float)


	Local i = $5f3759df - (Int Ptr(Varptr(x))[0] Shr 1)
	Local ox:Float = (Float Ptr(Varptr(i))[0])
	Return  ox *(1.5 - ((0.5*x)*ox*ox))
	
End Function



Original Bmax version:1586
Optimized Version:1316
Over 10000000 ilterations


Knotz(Posted 2006) [#3]
I get interesting results:

10000000 iterations:

Original c version:
1061ms

My BlitzMax version:
2191ms

Antsman's optimized version:
1204ms

Fredborg's optimized version:
1558ms

Normal 1/sqr:
2503ms

Seems that removing some locals and some assignments has a hefty influence on the performance.


fredborg(Posted 2006) [#4]
This seems slightly faster:
Function InvSqrtFast:Float(x:Float)

	Local i:Int = $5f3759df - (Int Ptr(Varptr(x))[0] Shr 1)
	Return  (Float Ptr(Varptr(i))[0]) *(1.5 - ((0.5*x)*(Float Ptr(Varptr(i))[0])*(Float Ptr(Varptr(i))[0])))
	
End Function



Knotz(Posted 2006) [#5]
Sorry fredborg, it isn't: 1558ms

Antman's is still the fastest.


fredborg(Posted 2006) [#6]
Hehe, on my machine it's a bit faster:
antman: 1243ms
mine: 1160ms


Knotz(Posted 2006) [#7]
Strange, I'm running it on a AMD processor.
I've changed the code in my first entry to show how i test the functions.


fredborg(Posted 2006) [#8]
This is what I get:
Raw C:1423
Knotz:2856
Artman:2139
Fredborg:1960
1/Sqr():3109


Perturbatio(Posted 2006) [#9]
using the code in the first post (excluding the C test):

Knotz:2752
Artman:1552
Fredborg:1880
1/Sqr():6507

superstrict:

Knotz:2673
Artman:1530
Fredborg:1849
1/Sqr():6170


Knotz(Posted 2006) [#10]
I get on my AMD Athlon X2 4200+:
Raw C:1362
Knotz:2820
Artman:1481
Fredborg:1713
1/Sqr():5305

On my wife's Pentium 4 2.4GHZ fredborg's is indeed (a tiny bit) faster than Artman. Or to say it differently: Artman's version performs better on an AMD than a Pentium.


AntonyWells(Posted 2006) [#11]
On a 1.8ghz Dou core centrino,

Executing:untitled1.exe
Knotz:2002
Artman:1413
Fredborg:1475
1/Sqr():7228 (lol)

Very close.


SculptureOfSoul(Posted 2006) [#12]
You guys are testing with debug build off, right?


bradford6(Posted 2006) [#13]
Please excuse the ignorance of this post.

How are you determining the accuracy of the function? Are the results of each function comparable to each other or are there varying levels of accuracy?


btw here are my results. (no c version)

Vanilla:
Knotz:2699
Artman:1608
Fredborg:1950
1/Sqr():6087

Superstrict:
Knotz:2657
Artman:1499
Fredborg:1807
1/Sqr():6129


SculptureOfSoul(Posted 2006) [#14]
Debug off:
Raw C:1758
Knotz:3189
Artman:2103
Fredborg:2527
1/Sqr():4006

debug on:
Raw C:7985
Knotz:19164
Artman:16634
Fredborg:15614
1/Sqr():18933


Knotz(Posted 2006) [#15]
The fast InvSqrt function that we're benchmarking is not as accurate as a normal 1/Sqr(x). But for 3d work it's accurate enough. In the article (link in my first post) you'll find more info on his topic.

The original algorithm itself is sheer black magic.

Fredborg's, Artman's and mine implementation are virtually the same and differ only in assignments and use of local variables. What's interesting is that Fredborg's and Antman's functions have different relative speeds on various processors.


SculptureOfSoul(Posted 2006) [#16]
Also, I'm not sure but might these yield the wrong results since an integer in Blitzmax is signed?


bradford6(Posted 2006) [#17]
Ok, thanks.

Again, please excuse my ignorance (I absolutely do not understand what is going on inside that function...but I'm working on it :) )

here is a little addition:
Function Compare_InvSqrt_results(label:String, func:Float(x:Float), value)
	Local res:Float	
	
	res=func(Float(value))
	
	Print label + "   ~t:"+ res
End Function

Local value:Float = 225000

'Compare_InvSqrt_results("Raw C",InvSqrt(value))
Compare_InvSqrt_results("Knotz",InvSqrtKnotz,value)
Compare_InvSqrt_results("Artman",InvSqrtAntman,value)
Compare_InvSqrt_results("Fredborg",InvSqrtFredborg,value)
Compare_InvSqrt_results("1/Sqr()",InvSqrtPlain,value)



my results:

Knotz   	:0.00210810988
Artman   	:0.00210810988
Fredborg   	:0.00210810988
1/Sqr()   	:0.00210818509



SculptureOfSoul(Posted 2006) [#18]
Here's an even faster version:
Function InvSqrtSOS:Float(x:Float)
     Local xhalf:Float = 0.5*x
	Local i:Int = $5f3759df - (Int Ptr(Varptr(x))[0] Shr 1)
	Local ox:Float = (Float Ptr(Varptr(i))[0])
	'xhalf = 1.5 -(xhalf*ox*ox)
	'Return ox * xhalf
	Return ox *(1.5 - (xhalf*ox*ox))
	
End Function


You can also try uncommenting the two commented lines and commenting the current return. That yields slightly different, but equally as fast (on my machine) results.

Here are my results (of the version posted, with the 2 lines commented out):

Raw C:1774
SoS:1838
Knotz:3062
Artman:2065
Fredborg:2483
1/Sqr():3973


Knotz(Posted 2006) [#19]
Indeed, you are. Kudos!
Updated the first post.


bradford6(Posted 2006) [#20]
SOS:1558
Knotz:2823
Artman:1612
Fredborg:1964
1/Sqr():4089


SculptureOfSoul(Posted 2006) [#21]
Here's another version if anyone feels like testing it. I'm pretty sure it'll always be slower than the first version I tested but...well, just thought I'd experiment.

Instead of returning the result, you pass the result in as a float pointer

This requires a modification to the benchmarking function as well, and so I only tested this solution "alone" (too lazy to change all the other functions :).

As it is, it's about 200 ms slower for me than my first version.

Here's the code to test it
Function InvSqrtSOS2(x:Float, ret:Float Ptr)
    Local xhalf:Float = 0.5*x
	Local i:Int = $5f3759df - Int Ptr(Varptr(x))[0] Shr 1
	Local ox:Float = Float Ptr(Varptr(i))[0]

	ret[0] = ox *(1.5 - (xhalf*ox*ox))
	
End Function

Function BenchmarkInvSqrt(label:String, func(x:Float, ret:Float Ptr))

	Local ii:Int
	Local res:Float Ptr = New Float[1]
	Local startTime:Long
	Local endTime:Long
	
	startTime = MilliSecs()
	For ii=1 To 100000000
	 func(Float(ii), res)
	'Print res
	Next
	
	endTime = MilliSecs() - startTime
	Print label + ":"+ endTime

End Function


BenchmarkInvSqrt("SoS2", InvSqrtSOS2)



Knotz(Posted 2006) [#22]
Even faster, note the Var argument:
10%(!!) faster than Raw C on my machine with some first tests.

Function InvSqrtSOS_2:Float(x:Float Var)

    Local xhalf:Float = 0.5*x
    Local i:Int = $5f3759df - (Int Ptr(Varptr(x))[0] Shr 1)
    Local ox:Float = (Float Ptr(Varptr(i))[0])
    Return ox *(1.5 - (xhalf*ox*ox))

End Function



Azathoth(Posted 2006) [#23]
Heres one I made that alters the contents of the pointer directly so theres no need for 'ox' and replaces 'i'.
It was the fastest on my computer just barely though.
Function InvSqrtAza:Float(x:Float)

	Local xhalf:Float = 0.5*x
	Local ip:Int Ptr=Int Ptr(Varptr(x))
	ip[0]=$5f3759df-(ip[0] Shr 1)
	Return x *(1.5 - (xhalf*x*x))

End Function



SculptureOfSoul(Posted 2006) [#24]

Even faster, note the Var argument:
10%(!!) faster than Raw C on my machine with some first tests.


Weird. I had tried a version with the Var argument (admittedly, I was passing in a second variable as a Var) and it slowed things down. In most of my other speed tests (I've been trying to build an ultra-fast hash table) passing by reference has lead to slower results as well. Of course, I'm on an old Pentium machine here (P4, 2.4 ghz - and my memory isn't all that fast, either)


Knotz(Posted 2006) [#25]
@Azathoth: Clever, but not faster on my machine. I've updated the source with your version.

@SculptureOfSoul: Yeah, one of the things that fascinates me is the different results on different pc architectures. Maybe Mark can enlighten me a bit.

I found some old thread where Dreamora & Co where discussing the same algorithm (Google cache) [a http://209.85.135.104/search?q=cache:lFc4bBDfnxQJ:wp1002880.wp004.webpack.hosteurope.de/viewtopic.php%3Ft%3D9157%26view%3Dnext%26sid%3D015aa8d02df49589939ef8bffc0237ef+InvSqrt+Blitzmax&hl=en&ct=clnk&cd=1&client=firefox-a]here[/a]

In that thread is also a FastSqrt function. I'll see if i can make a Blitzmax version of it also (done and updated the source in the first post).

My results:
InvSqrt:
Raw C: 1055ms
Knotz: 2071ms
Artman: 1227ms
Fredborg: 1514ms
SculptureOfSoul:1155ms
Azathoth: 1440ms
1/Sqr(): 4832ms

Sqrt:
Raw C: 1053ms
Knotz: 2089ms
Artman: 1237ms
Fredborg: 1501ms
SculptureOfSoul:1231ms
Azathoth: 1424ms
Sqr(): 2809ms


Azathoth(Posted 2006) [#26]
This is what i get on my computer minus the C version

InvSqrt:
Knotz: 2193ms
Artman: 1536ms
Fredborg: 1586ms
SculptureOfSoul: 1548ms
Azathoth: 1534ms
1/Sqr(): 3070ms

Sqrt:
Knotz: 2192ms
Artman: 1539ms
Fredborg: 1583ms
SculptureOfSoul: 1550ms
Azathoth: 1535ms
Sqr(): 1579ms


Knotz(Posted 2006) [#27]
Why is my regular Sqr() so much slower than yours? I'm running on a AMD Athlon X2 4200+, 2Gb Ram WinXP Home.


Azathoth(Posted 2006) [#28]
Maybe some Intel thing?


Knotz(Posted 2006) [#29]
Looks like it, my wife's P4 2.4Ghz blows my AMD 4200+ out of the water with Sqr(). I looked at Perturbatio (also AMD) results and they show the same behavior.


bradford6(Posted 2007) [#30]
The InvSqr numbers look good but the SQR numbers are quite a bit off (the results of the functions)

EDIT: this is the same as Knotz's first post but also shows a result for comparison of accuracy.




DJWoodgate(Posted 2007) [#31]
Very interesting. $5f3443ff seems to produce more accurate results for Invsqrt on my limited tests. No idea why, not sure what it is doing in the first place. Just thought I would play with a few digits. So I suppose the next challenge is to find an even more accurate approximation for what is a rather modest speed advantage.


bradford6(Posted 2007) [#32]
this is a bit more academic than useful. I suppose if you are doing a ton of these each frame this could make a difference. The Inverse Square results are close enough for 99% of bmaxers. The fastsqr functions are just wrong. (not even close)


JoshK(Posted 2008) [#33]
The fake Sqrt function is not even close to accurate, and inverting the results of 1.0/InvSqrt are slower than calling Sqr() in BMX. But InvSqrt() is fast.


Otus(Posted 2008) [#34]
Yeah, there's something wrong with that Sqrt in the first post. Here's an approximate Sqrt that takes roughly 1/5 the time BlitzMax Sqr does, but with ~1.5% average error.

float aSqr( float f ) {
	union { 
		float f ;
		int i ;
	} u ;
	u.f = f ;
	u.i = (1<<29) + (u.i>>1) - 0x0042DD02 ;
	return u.f ;
}



UByte(Posted 2008) [#35]
This is so freaky. To achieve a fish-eye effect in my current project I have to do the following transformation on each vertex

v' = v.z * v/|v|

I'm not ready to dabble with vertex shaders yet so have had to manually do the calculation on the CPU side. For high vertex counts the sqrt involved in calculating the magnitude of v is a massive bottleneck. Looking at your results I'll be able to potentially improve performance by up to 100%.

Many thangs guys.