Branch Prediction and BlitzMax

BlitzMax Forums/BlitzMax Programming/Branch Prediction and BlitzMax

Brucey

(Posted 2016) [#1]

So I was looking at this quite popular question on Stack Overflow : http://stackoverflow.com/questions/11227809/why-is-it-faster-to-process-a-sorted-array-than-an-unsorted-array
And wondered if it was also true of BlitzMax.

Here's our version of the code in BlitzMax :

SuperStrict

Framework BRL.Standardio
Import BRL.Random

Const arraySize:Int = 32768

Local data:Int[arraySize]

For Local i:Int = 0 Until arraySize
	data[i] = Rand(0, 256)
Next

' With this, the next loop runs faster
'data.Sort()

' Test
Local start:Int = MilliSecs()

Local sum:Long


For Local i:Int = 0 Until 100000

	For Local c:Int = 0 Until arraySize
		If data[c] >= 128 Then
			sum :+ data[c]
		End If
	Next

Next

Local time:Int = MilliSecs() - start

Print time
Print "sum = " + sum

Run this to get an unsorted time. Then run it again with the sort uncommented.

Interestingly, BlitzMax does appear to run faster against a sorted list.
On my Mac, I get 20 seconds for unsorted and 10 seconds for sorted! That's twice as fast!!

Then I decided to try the same test in BlitzMax NG.
Here are the results for 32-bit on Mac. 3.7 seconds for unsorted. 3.7 seconds for sorted !!
For 64-bit, 1.9 seconds for unsorted. 1.9 seconds for sorted.

As you can see, apart from being several orders of magnitude faster than classic BlitzMax, the times for sorted and unsorted are in fact the same. That shows that a modern compiler - like clang - isn't affected so much by these issues.

Derron

(Posted 2016) [#2]

@ NG speed
the link you gave right on top of your post explains already, why clang/GCC might perform so similar:

GCC 4.6.1 with -O3 or -ftree-vectorize on x64 is able to generate a conditional move. So there is no difference between the sorted and unsorted data - both are fast.

@ array sort
A pity custom sort is only available via mytype.compare() - which also affects List.Remove and other thingies.

bye
Ron

col	(Posted 2016) [#3]

Gives me ~14.0 secs unsorted and ~2.2 seconds when sorted!

NG x64

Brucey

(Posted 2016) [#4]

Gives me ~14.0 secs unsorted and ~2.2 seconds when sorted!

That's interesting. On a i7 Windows 10 box (GCC 5.1) I had similar stats to the Mac (SkyLake i5) - i.e. both times for sorted/unsorted the same.

BlitzSupport

(Posted 2016) [#5]

Similar results to col here...

Unsorted:

Building untitled1
Executing:untitled1.exe
12996
sum = 316934400000

Process complete

Sorted:

Building untitled1
Executing:untitled1.exe
3180
sum = 316934400000

Process complete

Couple of runs each, no practical difference between x86 and x64, either.

EDIT: gcc version 5.1.0 (tdm64-1)

xlsior

(Posted 2016) [#6]

Blitzmax NG does show a time difference with me.. Unless it's somehow ignoring the GCC 5.1.0 inside the blitzmax/minGW32 folder and using the old 4.7.1 that's in my path?

Blitzmax NG, GCC 5.1.0, Win10, intel I7-3930K @ 3.2GHz:

Win32 exe:
11.084 seconds unsorted
2.199 seconds sorted

Win64 exe:
10.971 seconds unsorted
1.733 seconds sorted

GW	(Posted 2016) [#7]

Very interesting..

[x64]
1055 Without sort
1055 with sort

[x86]
1127 with sort
1128 without

Win 7: On an Intel i7-2600k @3.4 and 16g ram.

xlsior

(Posted 2016) [#8]

Hm... Tried to remove my MinGW 4.7.1 completely and force a rebuild and ensure it's using 5.1.0, and am getting an error on rebuilding modules now:

[ 58%] Processing:filesystem.bmx
[ 58%] Processing:openalaudio.bmx
[ 58%] Processing:glutil.bmx
[ 58%] Processing:math3d.bmx
[ 58%] Processing:d3d9.bmx
[ 58%] Processing:sdlmixeraudio.bmx
Compile Error: Overriding method does not match any overridden method. (Detail: Return type is "Int", expected "Void". You may have Strict type overriding SuperStrict type.)
[c:/code/BlitzmaxNG510/mod/sdl.mod/sdlmixeraudio.mod/sdlmixeraudio.bmx;107;0]
Build Error: failed to compile (-1) c:/code/BlitzmaxNG510/mod/sdl.mod/sdlmixeraudio.mod/sdlmixeraudio.bmx
Done.

GW	(Posted 2016) [#9]

compile your mods with the -w switch

xlsior

(Posted 2016) [#10]

Interesting -- after using the custom.bmk and enabling the following line:

setccopt optimization -O3

and recompiling the modules, the results are significantly better:

[x64]
1040 without sort
1047 with sort

[x86]
1001 without sort
1003 with sort

(Still have an error on the SDL module though, it won't compile without errors, even after syncing a new copy from github)

xlsior

(Posted 2016) [#11]

compile your mods with the -w switch

IIRC the -w just supresses the warning messages -- but it appears it's not just a warning, it actually errors out and stops compilation at that point.

Derron

(Posted 2016) [#12]

As quoted in my first reply... You need to have -O3 as param for your gcc.

Brucey is surely using that.

@sdl.mod
Raise an issue for the errors.

Bye
Ron

Brucey

(Posted 2016) [#13]

setccopt optimization -O3

That is the default option in the latest make.bmk.
I will look into versioning those .bmk files (or something), to give a warning if you aren't using a recent one...

The original usage of -O2 was because early in bcc development I had instability issues with -O3 (over optimisation). That is not a problem now though.
-O3 should generally produce faster code.

sdlmixeraudio

I've been meaning to remove it from the repo, actually, because I find it clunky. But I'll fix the issue (due to the brl audio modules being made SuperStrict).

Anyway, back on topic :-)
Yes, it seems the -O3 optimiser is not so prone to branch predition issues.

col	(Posted 2016) [#14]

My first post was using an i7-3610QM in a laptop

The results here are using an intel i5-3570 in a desktop unit which gives

x86
unsorted ~11.2s
sorted ~2.2s

x64
unsorted ~11.2s
sorted ~1.7s

Even more bizarre is that putting the initial 'time' assignment as the first line of code also DECREASES the unsorted time by a over a second to ~10.0s while making no noticeable change in speed when sorted.

Regarding -O3, I'm using the latest of everything in the repo and the verbose output shows -O2?

Derron

(Posted 2016) [#15]

@ make.bmk
Maybe just define a variable wiith the min bcc/bmk versions and then check versus the current ones (bcc -v and bmk -v ... Or via function call params when bmk file is executed by the chain).

Bye
Ron

Brucey

(Posted 2016) [#16]

Regarding -O3, I'm using the latest of everything in the repo and the verbose output shows -O2?

This is the current version.

You may be overriding it in your custom.bmk ?

BlitzSupport

(Posted 2016) [#17]

Cool, with -O3 I get ~840 with sorted/unsorted on x86 and x64! Compared with 20 seconds unsorted and 12 seconds sorted on vanilla, that's amazing.

col	(Posted 2016) [#18]

You may be overriding it in your custom.bmk ?

err yeah... *ahem* sorry for that.

TomToad

(Posted 2016) [#19]

@ array sort
A pity custom sort is only available via mytype.compare() - which also affects List.Remove and other thingies.

Not entirely true. It is true of arrays, but lists can use a custom function instead of a method.
Here is a list sorted on only the lowest 3 bits, ignoring the upper 29 bits. The internal compare still works, allowing List.Remove and such to work normally

SuperStrict
Type MyType
	Field x:Int
	
	Function Create:MyType(i:Int)
		Local M:MyType = New MyType
		M.x = i
		Return M
	End Function
	
	Function MyCompare:Int(O1:Object,O2:Object)
		Return (MyType(O1).x & 7) - (MyType(O2).x & 7)
	End Function
End Type

Local List:TList = CreateList()

For Local i:Int = 0 To 99
	List.AddLast(MyType.Create(Rand(0,1000000)))
Next

List.Sort(True,MyType.MyCompare)

For Local M:MyType = EachIn List
	Print Bin(M.x)+" "+M.x
Next

Derron

(Posted 2016) [#20]

To quote myself:

@ array sort

I was aware of custom list sort options - and am already heavily using them in my framework and game.

bye
Ron

xlsior

(Posted 2016) [#21]

sdlmixeraudio

I've been meaning to remove it from the repo, actually, because I find it clunky. But I'll fix the issue (due to the brl audio modules being made SuperStrict).

FWIW, it seems that bah.soloud / bah.soloudaudio is also broken at the moment, as are magick.mod, cairo.mod and maxmod2

Derron

(Posted 2016) [#22]

I assume for the very same reason ;-)

BTW it might be time to push the soloud-issue (lib, not wrapper) raised some months ago.

bye
Ron

zzz	(Posted 2016) [#23]

SuperStrict

Framework BRL.Standardio
Import BRL.Random

Const arraySize:Int = 32768

Local data:Int[arraySize]

For Local i:Int = 0 Until arraySize
	data[i] = Rand(0, 256)
Next

' With this, the next loop runs faster
'data.Sort()

' Test
Local start:Int = MilliSecs()

Local sum:Long

For Local i:Int = 0 Until 100000

	For Local c:Int = 0 Until arraySize
		If data[c] >= 128 Then
			sum :+ data[c]
		End If
	Next

Next

Local time:Int = MilliSecs() - start

Print time
Print "sum = " + sum

' algorithmic optimization

start = MilliSecs()

sum = 0

For Local c:Int = 0 Until arraySize
	
	If data[c] >= 128 Then
		sum :+ data[c] * 100000
	EndIf
	
Next

time = MilliSecs() - start

Print time
Print "sum = " + sum

I added an obvious algorithmic optimization. Can gcc be made to do this in any way? Not to degrade gcc or bmax/ng, Im just curious if there are compilers that can catch low hanging fruit like this.

FireballStarfish

(Posted 2016) [#24]

Is it really low hanging fruit though?
For instance, what if the array contained values that could cause an overflow somewhere? All that's needed is changing the line data[i] = Rand(0, 256) near the top into something like data[i] = 1234567, and suddenly your optimized version is wrong, because it calculates a different sum.
But how would the compiler know Rand(0, 256) doesn't actually produce such numbers? To know that, it would need to analyze the contents of the Rand function (including all the functions it calls etc etc.) in order to predict what exactly will happen once you run your program, all before actually compiling your program - an incredibly complex, if not outright impossible, task.

zzz	(Posted 2016) [#25]

Yeah sure, but Its not part of the example. Every compiler ever written would fail if code was altered after analysis.

For your example I could still just move the i loop to the addition instead, which would dodge most of the branch prediction penalties and unless hardware implementation has changed, quite possibly run faster then gcc's cmov optimization. This would be a very local analysis.

Taking a quick look in brl.random the original example could be reduced to sum = whatever directly, since everything is hardcoded and Brucey didnt add a initial unpredictable seed.

Which led to my question, because I have never seen any of this happening in a compiler.