Fast InvSqrt in Max
BlitzMax Forums/BlitzMax Programming/Fast InvSqrt in Max
| ||
After reading this article: http://www.beyond3d.com/articles/fastinvsqrt/ And having some time to spare while waiting on the Sint i made a BM version of the original algorithm. Original c version: float math_InvSqrt (float x){ float xhalf = 0.5f*x; int i = *(int*)&x; i = 0x5f3759df - (i >> 1); x = *(float*)&i; x = x*(1.5f - xhalf*x*x); return x; } float math_Sqrt (float x) { float xhalf = 0.5f*x; int i = *(int*)&x; i = (0xbe6f0000 - i)>>1; x = *(float*)&i; x = x*(1.5f - xhalf*x*x); return x; } and my Blitzmax version: Import "InvSqrt.c" Extern Function InvSqrt:Float(x:Float) = "math_InvSqrt" Function Sqrt:Float(x:Float) = "math_Sqrt" End Extern Function InvSqrtKnotz:Float(x:Float) Local xhalf:Float = 0.5*x Local i:Int = (Int Ptr(Varptr(x)))[0] i = $5f3759df - (i Shr 1) x = (Float Ptr(Varptr(i)))[0] x = x*(1.5 - xhalf*x*x) Return x End Function Function InvSqrtAntMan:Float(x:Float) Local i = $5f3759df - (Int Ptr(Varptr(x))[0] Shr 1) Local ox:Float = (Float Ptr(Varptr(i))[0]) Return ox *(1.5 - ((0.5*x)*ox*ox)) End Function Function InvSqrtFredborg:Float(x:Float) Local i:Int = $5f3759df - (Int Ptr(Varptr(x))[0] Shr 1) Return (Float Ptr(Varptr(i))[0]) *(1.5 - ((0.5*x)*(Float Ptr(Varptr(i))[0])*(Float Ptr(Varptr(i))[0]))) End Function Function InvSqrtSOS:Float(x:Float) Local xhalf:Float = 0.5*x Local i:Int = $5f3759df - (Int Ptr(Varptr(x))[0] Shr 1) Local ox:Float = (Float Ptr(Varptr(i))[0]) Return ox *(1.5 - (xhalf*ox*ox)) End Function Function InvSqrtAza:Float(x:Float) Local xhalf:Float = 0.5*x Local ip:Int Ptr=Int Ptr(Varptr(x)) ip[0]=$5f3759df-(ip[0] Shr 1) Return x *(1.5 - (xhalf*x*x)) End Function Function InvSqrtPlain:Float(x:Float) Return 1.0/Sqr(x) End Function Function SqrtKnotz:Float(x:Float) Local xhalf:Float = 0.5*x Local i:Int = (Int Ptr(Varptr(x)))[0] i = ($be6f0000 - i) Shr 1 x = (Float Ptr(Varptr(i)))[0] x = x*(1.5 - xhalf*x*x) Return x End Function Function SqrtAntMan:Float(x:Float) Local i = ($be6f0000 - (Int Ptr(Varptr(x))[0])) Shr 1 Local ox:Float = (Float Ptr(Varptr(i))[0]) Return ox *(1.5 - ((0.5*x)*ox*ox)) End Function Function SqrtFredborg:Float(x:Float) Local i:Int = ($be6f0000 - (Int Ptr(Varptr(x))[0])) Shr 1 Return (Float Ptr(Varptr(i))[0]) *(1.5 - ((0.5*x)*(Float Ptr(Varptr(i))[0])*(Float Ptr(Varptr(i))[0]))) End Function Function SqrtSOS:Float(x:Float) Local xhalf:Float = 0.5*x Local i:Int = ($be6f0000 - (Int Ptr(Varptr(x))[0])) Shr 1 Local ox:Float = (Float Ptr(Varptr(i))[0]) Return ox *(1.5 - (xhalf*ox*ox)) End Function Function SqrtAza:Float(x:Float) Local xhalf:Float = 0.5*x Local ip:Int Ptr=Int Ptr(Varptr(x)) ip[0]=($be6f0000 - ip[0]) Shr 1 Return x *(1.5 - (xhalf*x*x)) End Function Function SqrtPlain:Float(x:Float) Return Sqr(x) End Function Function Benchmark(label:String, func:Float(x:Float)) Local startTime:Long = MilliSecs() For Local ii:Int=1 To 100000000 func(Float(ii)) Next Local endTime:Long = MilliSecs() - startTime Print label + ":~t~t" + endTime + "ms" End Function Print "~nInvSqrt:" Benchmark("Raw C",InvSqrt) Benchmark("Knotz",InvSqrtKnotz) Benchmark("Artman",InvSqrtAntman) Benchmark("Fredborg",InvSqrtFredborg) Benchmark("SculptureOfSoul",InvSqrtSOS) Benchmark("Azathoth",InvSqrtAza) Benchmark("1/Sqr()",InvSqrtPlain) Print "~nSqrt:" Benchmark("Raw C",Sqrt) Benchmark("Knotz",SqrtKnotz) Benchmark("Artman",SqrtAntman) Benchmark("Fredborg",SqrtFredborg) Benchmark("SculptureOfSoul",SqrtSOS) Benchmark("Azathoth",SqrtAza) Benchmark("Sqr()",SqrtPlain) Some more information can be found here: http://www.math.purdue.edu/~clomont/Math/Papers/2003/InvSqrt.pdf I've included the entries from the posts in the code. I've also added a fast implementation of the regular Sqr(). Have a nice weekend. |
| ||
Here's an optimized version,Function InvSqrtFast:Float(x:Float) Local i = $5f3759df - (Int Ptr(Varptr(x))[0] Shr 1) Local ox:Float = (Float Ptr(Varptr(i))[0]) Return ox *(1.5 - ((0.5*x)*ox*ox)) End Function Original Bmax version:1586 Optimized Version:1316 Over 10000000 ilterations |
| ||
I get interesting results: 10000000 iterations: Original c version: 1061ms My BlitzMax version: 2191ms Antsman's optimized version: 1204ms Fredborg's optimized version: 1558ms Normal 1/sqr: 2503ms Seems that removing some locals and some assignments has a hefty influence on the performance. |
| ||
This seems slightly faster:Function InvSqrtFast:Float(x:Float) Local i:Int = $5f3759df - (Int Ptr(Varptr(x))[0] Shr 1) Return (Float Ptr(Varptr(i))[0]) *(1.5 - ((0.5*x)*(Float Ptr(Varptr(i))[0])*(Float Ptr(Varptr(i))[0]))) End Function |
| ||
Sorry fredborg, it isn't: 1558ms Antman's is still the fastest. |
| ||
Hehe, on my machine it's a bit faster: antman: 1243ms mine: 1160ms |
| ||
Strange, I'm running it on a AMD processor. I've changed the code in my first entry to show how i test the functions. |
| ||
This is what I get: Raw C:1423 Knotz:2856 Artman:2139 Fredborg:1960 1/Sqr():3109 |
| ||
using the code in the first post (excluding the C test): Knotz:2752 Artman:1552 Fredborg:1880 1/Sqr():6507 superstrict: Knotz:2673 Artman:1530 Fredborg:1849 1/Sqr():6170 |
| ||
I get on my AMD Athlon X2 4200+: Raw C:1362 Knotz:2820 Artman:1481 Fredborg:1713 1/Sqr():5305 On my wife's Pentium 4 2.4GHZ fredborg's is indeed (a tiny bit) faster than Artman. Or to say it differently: Artman's version performs better on an AMD than a Pentium. |
| ||
On a 1.8ghz Dou core centrino, Executing:untitled1.exe Knotz:2002 Artman:1413 Fredborg:1475 1/Sqr():7228 (lol) Very close. |
| ||
You guys are testing with debug build off, right? |
| ||
Please excuse the ignorance of this post. How are you determining the accuracy of the function? Are the results of each function comparable to each other or are there varying levels of accuracy? btw here are my results. (no c version) Vanilla: Knotz:2699 Artman:1608 Fredborg:1950 1/Sqr():6087 Superstrict: Knotz:2657 Artman:1499 Fredborg:1807 1/Sqr():6129 |
| ||
Debug off: Raw C:1758 Knotz:3189 Artman:2103 Fredborg:2527 1/Sqr():4006 debug on: Raw C:7985 Knotz:19164 Artman:16634 Fredborg:15614 1/Sqr():18933 |
| ||
The fast InvSqrt function that we're benchmarking is not as accurate as a normal 1/Sqr(x). But for 3d work it's accurate enough. In the article (link in my first post) you'll find more info on his topic. The original algorithm itself is sheer black magic. Fredborg's, Artman's and mine implementation are virtually the same and differ only in assignments and use of local variables. What's interesting is that Fredborg's and Antman's functions have different relative speeds on various processors. |
| ||
Also, I'm not sure but might these yield the wrong results since an integer in Blitzmax is signed? |
| ||
Ok, thanks. Again, please excuse my ignorance (I absolutely do not understand what is going on inside that function...but I'm working on it :) ) here is a little addition: Function Compare_InvSqrt_results(label:String, func:Float(x:Float), value) Local res:Float res=func(Float(value)) Print label + " ~t:"+ res End Function Local value:Float = 225000 'Compare_InvSqrt_results("Raw C",InvSqrt(value)) Compare_InvSqrt_results("Knotz",InvSqrtKnotz,value) Compare_InvSqrt_results("Artman",InvSqrtAntman,value) Compare_InvSqrt_results("Fredborg",InvSqrtFredborg,value) Compare_InvSqrt_results("1/Sqr()",InvSqrtPlain,value) my results: Knotz :0.00210810988 Artman :0.00210810988 Fredborg :0.00210810988 1/Sqr() :0.00210818509 |
| ||
Here's an even faster version:Function InvSqrtSOS:Float(x:Float) Local xhalf:Float = 0.5*x Local i:Int = $5f3759df - (Int Ptr(Varptr(x))[0] Shr 1) Local ox:Float = (Float Ptr(Varptr(i))[0]) 'xhalf = 1.5 -(xhalf*ox*ox) 'Return ox * xhalf Return ox *(1.5 - (xhalf*ox*ox)) End Function You can also try uncommenting the two commented lines and commenting the current return. That yields slightly different, but equally as fast (on my machine) results. Here are my results (of the version posted, with the 2 lines commented out): Raw C:1774 SoS:1838 Knotz:3062 Artman:2065 Fredborg:2483 1/Sqr():3973 |
| ||
Indeed, you are. Kudos! Updated the first post. |
| ||
SOS:1558 Knotz:2823 Artman:1612 Fredborg:1964 1/Sqr():4089 |
| ||
Here's another version if anyone feels like testing it. I'm pretty sure it'll always be slower than the first version I tested but...well, just thought I'd experiment. Instead of returning the result, you pass the result in as a float pointer This requires a modification to the benchmarking function as well, and so I only tested this solution "alone" (too lazy to change all the other functions :). As it is, it's about 200 ms slower for me than my first version. Here's the code to test it Function InvSqrtSOS2(x:Float, ret:Float Ptr) Local xhalf:Float = 0.5*x Local i:Int = $5f3759df - Int Ptr(Varptr(x))[0] Shr 1 Local ox:Float = Float Ptr(Varptr(i))[0] ret[0] = ox *(1.5 - (xhalf*ox*ox)) End Function Function BenchmarkInvSqrt(label:String, func(x:Float, ret:Float Ptr)) Local ii:Int Local res:Float Ptr = New Float[1] Local startTime:Long Local endTime:Long startTime = MilliSecs() For ii=1 To 100000000 func(Float(ii), res) 'Print res Next endTime = MilliSecs() - startTime Print label + ":"+ endTime End Function BenchmarkInvSqrt("SoS2", InvSqrtSOS2) |
| ||
Even faster, note the Var argument: 10%(!!) faster than Raw C on my machine with some first tests. Function InvSqrtSOS_2:Float(x:Float Var) Local xhalf:Float = 0.5*x Local i:Int = $5f3759df - (Int Ptr(Varptr(x))[0] Shr 1) Local ox:Float = (Float Ptr(Varptr(i))[0]) Return ox *(1.5 - (xhalf*ox*ox)) End Function |
| ||
Heres one I made that alters the contents of the pointer directly so theres no need for 'ox' and replaces 'i'. It was the fastest on my computer just barely though. Function InvSqrtAza:Float(x:Float) Local xhalf:Float = 0.5*x Local ip:Int Ptr=Int Ptr(Varptr(x)) ip[0]=$5f3759df-(ip[0] Shr 1) Return x *(1.5 - (xhalf*x*x)) End Function |
| ||
Even faster, note the Var argument: 10%(!!) faster than Raw C on my machine with some first tests. Weird. I had tried a version with the Var argument (admittedly, I was passing in a second variable as a Var) and it slowed things down. In most of my other speed tests (I've been trying to build an ultra-fast hash table) passing by reference has lead to slower results as well. Of course, I'm on an old Pentium machine here (P4, 2.4 ghz - and my memory isn't all that fast, either) |
| ||
@Azathoth: Clever, but not faster on my machine. I've updated the source with your version. @SculptureOfSoul: Yeah, one of the things that fascinates me is the different results on different pc architectures. Maybe Mark can enlighten me a bit. I found some old thread where Dreamora & Co where discussing the same algorithm (Google cache) [a http://209.85.135.104/search?q=cache:lFc4bBDfnxQJ:wp1002880.wp004.webpack.hosteurope.de/viewtopic.php%3Ft%3D9157%26view%3Dnext%26sid%3D015aa8d02df49589939ef8bffc0237ef+InvSqrt+Blitzmax&hl=en&ct=clnk&cd=1&client=firefox-a]here[/a] In that thread is also a FastSqrt function. I'll see if i can make a Blitzmax version of it also (done and updated the source in the first post). My results: InvSqrt: Raw C: 1055ms Knotz: 2071ms Artman: 1227ms Fredborg: 1514ms SculptureOfSoul:1155ms Azathoth: 1440ms 1/Sqr(): 4832ms Sqrt: Raw C: 1053ms Knotz: 2089ms Artman: 1237ms Fredborg: 1501ms SculptureOfSoul:1231ms Azathoth: 1424ms Sqr(): 2809ms |
| ||
This is what i get on my computer minus the C version InvSqrt: Knotz: 2193ms Artman: 1536ms Fredborg: 1586ms SculptureOfSoul: 1548ms Azathoth: 1534ms 1/Sqr(): 3070ms Sqrt: Knotz: 2192ms Artman: 1539ms Fredborg: 1583ms SculptureOfSoul: 1550ms Azathoth: 1535ms Sqr(): 1579ms |
| ||
Why is my regular Sqr() so much slower than yours? I'm running on a AMD Athlon X2 4200+, 2Gb Ram WinXP Home. |
| ||
Maybe some Intel thing? |
| ||
Looks like it, my wife's P4 2.4Ghz blows my AMD 4200+ out of the water with Sqr(). I looked at Perturbatio (also AMD) results and they show the same behavior. |
| ||
The InvSqr numbers look good but the SQR numbers are quite a bit off (the results of the functions) EDIT: this is the same as Knotz's first post but also shows a result for comparison of accuracy. |
| ||
Very interesting. $5f3443ff seems to produce more accurate results for Invsqrt on my limited tests. No idea why, not sure what it is doing in the first place. Just thought I would play with a few digits. So I suppose the next challenge is to find an even more accurate approximation for what is a rather modest speed advantage. |
| ||
this is a bit more academic than useful. I suppose if you are doing a ton of these each frame this could make a difference. The Inverse Square results are close enough for 99% of bmaxers. The fastsqr functions are just wrong. (not even close) |
| ||
The fake Sqrt function is not even close to accurate, and inverting the results of 1.0/InvSqrt are slower than calling Sqr() in BMX. But InvSqrt() is fast. |
| ||
Yeah, there's something wrong with that Sqrt in the first post. Here's an approximate Sqrt that takes roughly 1/5 the time BlitzMax Sqr does, but with ~1.5% average error.float aSqr( float f ) { union { float f ; int i ; } u ; u.f = f ; u.i = (1<<29) + (u.i>>1) - 0x0042DD02 ; return u.f ; } |
| ||
This is so freaky. To achieve a fish-eye effect in my current project I have to do the following transformation on each vertex v' = v.z * v/|v| I'm not ready to dabble with vertex shaders yet so have had to manually do the calculation on the CPU side. For high vertex counts the sqrt involved in calculating the magnitude of v is a massive bottleneck. Looking at your results I'll be able to potentially improve performance by up to 100%. Many thangs guys. |