Branch Prediction and BlitzMax
BlitzMax Forums/BlitzMax Programming/Branch Prediction and BlitzMax
| ||
So I was looking at this quite popular question on Stack Overflow : http://stackoverflow.com/questions/11227809/why-is-it-faster-to-process-a-sorted-array-than-an-unsorted-array And wondered if it was also true of BlitzMax. Here's our version of the code in BlitzMax : SuperStrict Framework BRL.Standardio Import BRL.Random Const arraySize:Int = 32768 Local data:Int[arraySize] For Local i:Int = 0 Until arraySize data[i] = Rand(0, 256) Next ' With this, the next loop runs faster 'data.Sort() ' Test Local start:Int = MilliSecs() Local sum:Long For Local i:Int = 0 Until 100000 For Local c:Int = 0 Until arraySize If data[c] >= 128 Then sum :+ data[c] End If Next Next Local time:Int = MilliSecs() - start Print time Print "sum = " + sum Run this to get an unsorted time. Then run it again with the sort uncommented. Interestingly, BlitzMax does appear to run faster against a sorted list. On my Mac, I get 20 seconds for unsorted and 10 seconds for sorted! That's twice as fast!! Then I decided to try the same test in BlitzMax NG. Here are the results for 32-bit on Mac. 3.7 seconds for unsorted. 3.7 seconds for sorted !! For 64-bit, 1.9 seconds for unsorted. 1.9 seconds for sorted. As you can see, apart from being several orders of magnitude faster than classic BlitzMax, the times for sorted and unsorted are in fact the same. That shows that a modern compiler - like clang - isn't affected so much by these issues. |
| ||
@ NG speed the link you gave right on top of your post explains already, why clang/GCC might perform so similar: GCC 4.6.1 with -O3 or -ftree-vectorize on x64 is able to generate a conditional move. So there is no difference between the sorted and unsorted data - both are fast. @ array sort A pity custom sort is only available via mytype.compare() - which also affects List.Remove and other thingies. bye Ron |
| ||
Gives me ~14.0 secs unsorted and ~2.2 seconds when sorted! NG x64 |
| ||
Gives me ~14.0 secs unsorted and ~2.2 seconds when sorted! That's interesting. On a i7 Windows 10 box (GCC 5.1) I had similar stats to the Mac (SkyLake i5) - i.e. both times for sorted/unsorted the same. |
| ||
Similar results to col here... Unsorted: Building untitled1 Executing:untitled1.exe 12996 sum = 316934400000 Process complete Sorted: Building untitled1 Executing:untitled1.exe 3180 sum = 316934400000 Process complete Couple of runs each, no practical difference between x86 and x64, either. EDIT: gcc version 5.1.0 (tdm64-1) |
| ||
Blitzmax NG does show a time difference with me.. Unless it's somehow ignoring the GCC 5.1.0 inside the blitzmax/minGW32 folder and using the old 4.7.1 that's in my path? Blitzmax NG, GCC 5.1.0, Win10, intel I7-3930K @ 3.2GHz: Win32 exe: 11.084 seconds unsorted 2.199 seconds sorted Win64 exe: 10.971 seconds unsorted 1.733 seconds sorted |
| ||
Very interesting.. [x64] 1055 Without sort 1055 with sort [x86] 1127 with sort 1128 without Win 7: On an Intel i7-2600k @3.4 and 16g ram. |
| ||
Hm... Tried to remove my MinGW 4.7.1 completely and force a rebuild and ensure it's using 5.1.0, and am getting an error on rebuilding modules now:[ 58%] Processing:filesystem.bmx [ 58%] Processing:openalaudio.bmx [ 58%] Processing:glutil.bmx [ 58%] Processing:math3d.bmx [ 58%] Processing:d3d9.bmx [ 58%] Processing:sdlmixeraudio.bmx Compile Error: Overriding method does not match any overridden method. (Detail: Return type is "Int", expected "Void". You may have Strict type overriding SuperStrict type.) [c:/code/BlitzmaxNG510/mod/sdl.mod/sdlmixeraudio.mod/sdlmixeraudio.bmx;107;0] Build Error: failed to compile (-1) c:/code/BlitzmaxNG510/mod/sdl.mod/sdlmixeraudio.mod/sdlmixeraudio.bmx Done. |
| ||
compile your mods with the -w switch |
| ||
Interesting -- after using the custom.bmk and enabling the following line:setccopt optimization -O3 and recompiling the modules, the results are significantly better: [x64] 1040 without sort 1047 with sort [x86] 1001 without sort 1003 with sort (Still have an error on the SDL module though, it won't compile without errors, even after syncing a new copy from github) |
| ||
compile your mods with the -w switch IIRC the -w just supresses the warning messages -- but it appears it's not just a warning, it actually errors out and stops compilation at that point. |
| ||
As quoted in my first reply... You need to have -O3 as param for your gcc. Brucey is surely using that. @sdl.mod Raise an issue for the errors. Bye Ron |
| ||
setccopt optimization -O3 That is the default option in the latest make.bmk. I will look into versioning those .bmk files (or something), to give a warning if you aren't using a recent one... The original usage of -O2 was because early in bcc development I had instability issues with -O3 (over optimisation). That is not a problem now though. -O3 should generally produce faster code. sdlmixeraudio I've been meaning to remove it from the repo, actually, because I find it clunky. But I'll fix the issue (due to the brl audio modules being made SuperStrict). Anyway, back on topic :-) Yes, it seems the -O3 optimiser is not so prone to branch predition issues. |
| ||
My first post was using an i7-3610QM in a laptop The results here are using an intel i5-3570 in a desktop unit which gives x86 unsorted ~11.2s sorted ~2.2s x64 unsorted ~11.2s sorted ~1.7s Even more bizarre is that putting the initial 'time' assignment as the first line of code also DECREASES the unsorted time by a over a second to ~10.0s while making no noticeable change in speed when sorted. Regarding -O3, I'm using the latest of everything in the repo and the verbose output shows -O2? |
| ||
@ make.bmk Maybe just define a variable wiith the min bcc/bmk versions and then check versus the current ones (bcc -v and bmk -v ... Or via function call params when bmk file is executed by the chain). Bye Ron |
| ||
Regarding -O3, I'm using the latest of everything in the repo and the verbose output shows -O2? This is the current version. You may be overriding it in your custom.bmk ? |
| ||
Cool, with -O3 I get ~840 with sorted/unsorted on x86 and x64! Compared with 20 seconds unsorted and 12 seconds sorted on vanilla, that's amazing. |
| ||
You may be overriding it in your custom.bmk ? err yeah... *ahem* sorry for that. |
| ||
@ array sort A pity custom sort is only available via mytype.compare() - which also affects List.Remove and other thingies. Not entirely true. It is true of arrays, but lists can use a custom function instead of a method. Here is a list sorted on only the lowest 3 bits, ignoring the upper 29 bits. The internal compare still works, allowing List.Remove and such to work normally |
| ||
To quote myself: @ array sort I was aware of custom list sort options - and am already heavily using them in my framework and game. bye Ron |
| ||
sdlmixeraudio I've been meaning to remove it from the repo, actually, because I find it clunky. But I'll fix the issue (due to the brl audio modules being made SuperStrict). FWIW, it seems that bah.soloud / bah.soloudaudio is also broken at the moment, as are magick.mod, cairo.mod and maxmod2 |
| ||
I assume for the very same reason ;-) BTW it might be time to push the soloud-issue (lib, not wrapper) raised some months ago. bye Ron |
| ||
I added an obvious algorithmic optimization. Can gcc be made to do this in any way? Not to degrade gcc or bmax/ng, Im just curious if there are compilers that can catch low hanging fruit like this. |
| ||
Is it really low hanging fruit though? For instance, what if the array contained values that could cause an overflow somewhere? All that's needed is changing the line data[i] = Rand(0, 256) near the top into something like data[i] = 1234567, and suddenly your optimized version is wrong, because it calculates a different sum. But how would the compiler know Rand(0, 256) doesn't actually produce such numbers? To know that, it would need to analyze the contents of the Rand function (including all the functions it calls etc etc.) in order to predict what exactly will happen once you run your program, all before actually compiling your program - an incredibly complex, if not outright impossible, task. |
| ||
Yeah sure, but Its not part of the example. Every compiler ever written would fail if code was altered after analysis. For your example I could still just move the i loop to the addition instead, which would dodge most of the branch prediction penalties and unless hardware implementation has changed, quite possibly run faster then gcc's cmov optimization. This would be a very local analysis. Taking a quick look in brl.random the original example could be reduced to sum = whatever directly, since everything is hardcoded and Brucey didnt add a initial unpredictable seed. Which led to my question, because I have never seen any of this happening in a compiler. |