SSE Speed Test

BlitzMax Forums/BlitzMax Programming/SSE Speed Test

TeraBit

(Posted 2007) [#1]

Hi All,

I've been playing with splicing some Assembly optimised SSE math routines into my BlitzMax programs. If you could download the speed test and post the results here it would give me an idea as to how worthwhile it is doing so.

Download the test (60Kb)

It uses the SSE SIMD processor instructions so anything Pentium/Althon XP and up should be able to run it.

H&K	(Posted 2007) [#2]

691ms
31ms
x22.29

78ms
41ms
x1.90

_33	(Posted 2007) [#3]

Vertex Matrix Multiply:
blitz -> 282 ms
SSE Assembly -> 20ms

Vector normalisation:
blitz -> 45 ms
SSE Assembly -> 26 ms

AMD Opteron 165 @ 2.8 Ghz

TeraBit

(Posted 2007) [#4]

The normalisation is pretty efficient in Blitz as it is. Nice to see a 2x(ish) speed for using the SSE instructions though. Between 10x - 20x improvement in Matrix Math O.o Sweet. These are all handled as individual calls. There is probably more scope for optimising for batches.

One bugbear is that I have needed to use the unaligned SSE instructions (slower) since I can't see an easy way to get variable addresses aligned to 16 byte boundaries. :/

I'm beginning to like Assembly patching. 8D

Edit: Hmmm, inline assembly would be even cooler. ;)

iprice

(Posted 2007) [#5]

Avira says there is a trojan horse in your program :-/

Dreamora

(Posted 2007) [#6]

If you want to speed it up generally, open the bmk source files and replace the processor flag for the mingw call with P3 calls (MMX, SSE, SSE2).
Make sure you have a more current MingW than BRL supports right now.

Then rebuild the math modules and all other core modules but make sure not to touch maxgui or any image format library in pub. especially libpng does not like the MMX flag (you can set p3 flag -mmx if you want)

This sweet little trick will raise the general performance of your app by 50-150% and allow BM to use more than one core for its calculation (don't ask me why but 70% cpu usage on a dual core is a clear signal that the SSE / SSE2 part actually have internal async calculation support).
And that were values from simple stuff, not math or other heavy math usage which should become even larger.
As a small pro of this way of doing it: the float inaccuracy problems of BM are gone.

here the values:

Matrix:
-Blitz: 221ms
-SSE: 16ms -- Speedup 13.8125

Vector:
-Blitz: 50ms
-SSE: 24ms -- Speedup 2.0833

Core2Duo E6600 overclocked to 2x 3Ghz

_33	(Posted 2007) [#7]

Dreamora, what is it that this does generally speaking? You're recompiling the BRL libraries with usage of MMX / SSE / SSE2?

MGE	(Posted 2007) [#8]

"This sweet little trick will raise the general performance of your app by 50-150%.."

Dreamora, request......PLEASE..

Could you start a new thread in the forum about this and post detailed instructions how to do this, and also any detailed info about the good and bad about doing this. Please? Thanks.

Dreamora

(Posted 2007) [#9]

Simple said: The MingW used by default by BlitzMax is that old that the newest processor support is Pentium Pro.
But BM only uses Pentium ie no MMX support.

With the new MingW you get far more options like P3, P4 or for my usage P-M which is optimized for "short" pipeline depth which is what all have beside P4.

To change the processor architecture, just open src/bmk/bmk_util.bmx and look in compileC function for the -march= command.
That will set the targeted processor architecture. Check out google or documentations on the newer GCC for all available architectures and the other options you could set.

My command line there :

opts:+" -march=pentium-m -ffast-math -fno-exceptions"

now build bmk again and replace the one in the bin folder (I would keep the original)

TeraBit

(Posted 2007) [#10]

Thanks for the posts guys. I'm not planning to write anything very extensive in Assembly. It's mainly the routines that are in the tightest loops like Transformation, Normalisation, Geometry Intersection, Ray Casting etc.

@ f4ktor

Avira says there is a trojan horse in your program :-/

Thanks. I scanned the .EXE before uploading it. The .EXE is packed and has the .ASM DLL bundled into it. Some scanners interpret this kind of bundling as 'dodgy behaviour'. I probably wouldn't bother in a production environment, but for the purposes of the test, it's easier to distribute a single small .EXE file.

@ Dreamora

I'm hoping this sort of thing will become an official option soon. A free speed up without all the .ASM insanity is always welcome ;)

I'd be interested to know how fast the Matrix Multiplication is using the Native Tweaked Blitz!

the code:

SuperStrict
Framework brl.GLMax2D

SetGraphicsDriver GLMax2DDriver() 
AppTitle = "Assembly Language Optimisations Test"
Graphics 640, 480

Global mat:Matrix4 = New Matrix4

Local v:Vector3 = New Vector3
Local a:Vector3 = New Vector3
Local M:Int = 0
Local Counter:Int = 0
Local Timer1:Int = 0

mat.Identity
v.x = 1
v.y = 1
v.z = 1

m = MilliSecs() 
For counter = 1 To 999999
	mat.MultiplyVector (a, v) 
Next
timer1:Int = MilliSecs() - m

While Not KeyDown(KEY_ESCAPE) 
	Cls
	
	DrawText "999999 x Vertex Matrix Multiply: Blitz Native Opt: " + Timer1 + " ms.", 10, 30
	DrawText "Press ESC to exit", 10, 460
	Flip
Wend

Type Matrix4
    Field m:Float[16] 
    Method Zero() 
      Local a:Int
      For a = 0 To 15
        Self.m[a]=0
      Next
    EndMethod
   Method Identity() 
      Zero()
      Self.m[0] = 1
      Self.m[5] = 1
      Self.m[10]= 1
      Self.m[15]= 1
    EndMethod
    Method MultiplyVector(ans:Vector3 Var, n:Vector3) 
      Local tmp:Double[3]
      Local tmp2:Double[3]
      Local itmp3:Double
      itmp3 = 1.0 /(n.x*m[3]+n.y*m[7]+n.z*m[11]+m[15])
      tmp[0] = itmp3*n.x
      tmp[1] = itmp3*n.y
      tmp[2] = itmp3*n.z
      ans.x=tmp[0]*m[0]+tmp[1]*m[4]+tmp[2]*m[ 8]+itmp3*m[12]
      ans.y=tmp[0]*m[1]+tmp[1]*m[5]+tmp[2]*m[ 9]+itmp3*m[13]
      ans.z=tmp[0]*m[2]+tmp[1]*m[6]+tmp[2]*m[10]+itmp3*m[14]
    EndMethod
End Type
Type Vector3
    Field x:Float, y:Float, z:Float
End Type

iprice

(Posted 2007) [#11]

Vertex Matrix Multiply - BMX: 350ms
Vertex Matrix Multiply - SSE: 29ms (ca 12 times faster)

Vector Normalisation - BMX: 62ms
Vector Normalisation - SSE: 29ms (ca. 2 times faster)

Intel Core2Duo E4300 @ 2,4GHz

TeraBit

(Posted 2007) [#12]

Update: Managed to shave another 30% off of the ASM version of the Matrix Multiply code. I'll post an updated test later this evening.
So in theory Dreamora's numbers should end up something like:

Matrix:
-Blitz: 221ms
-SSE: 12ms -- Speedup 18.4 x

Any 'tweaked out' individual going have a got at posting the numbers for the above code?

degac

(Posted 2007) [#13]

Vertex Matrix Multiply - BMX: 317ms
Vertex Matrix Multiply - SSE: 20ms

Vector Normalisation - BMX: 51ms
Vector Normalisation - SSE: 29ms

My computer at office:
Athlon64 X2 Dual Core 4600+ 2,41 Ghz
2 GB Ram
Windows XP PRO Sp2

MGE	(Posted 2007) [#14]

" To change the processor architecture, just open src/bmk/bmk_util.bmx and look in compileC function for the -march= command. "

"My command line there :"

opts:+" -march=pentium-m -ffast-math -fno-exceptions"

"now build bmk again and replace the one in the bin folder (I would keep the original)."

I was going to try this until I read:

"Then rebuild the math modules and all other core modules but make sure not to touch maxgui or any image format library in pub. especially libpng does not like the MMX flag (you can set p3 flag -mmx if you want)"

So now I'm confused. Like I said, a new thread i nthe forum detailing exactly what to do and what not to do, detailing any side effects would be appreciated when/if you have time. Thanks!

Derron

(Posted 2007) [#15]

1) open "blitzmaxinstalldir/src/bmk/bmk_util.bmx"
2) search "function compileC"
3) search ?Win32 ... "opts:+"
4) replace like you wish
5) build the file (F5)
6) copy the new "bmk.exe" into "blitzmaxinstalldir/bin"

7) open BlIDE
8) select "Modules - Advanced Modules Builder" (or so)
9) click on "inverse selection"
10) deselect all things with "jpg", "tga", "png" (brl.pngloader and so on)
11) deselect axe.maxgui (or where it was included)
12) press "build modules"
13) wait until finished

14) all modules are now compiled using your installed mingw/gcc-version and the settings made in the "bmk_util.bmx"-file

if someone thinks this should be a new post... split it apart from this thread, thanks.

bye
MB

Perturbatio

(Posted 2007) [#16]

First Test:
BMAX: 306ms
ASM: 18ms

(17 x faster)

Second Test:
BMax: 48ms
ASM: 27ms

(1.777779 x faster)

Grisu

(Posted 2007) [#17]

First Test
14xfaster

Second Test:
1.7xfaster

Question: Will this also speed up maxgui apps?

MGE	(Posted 2007) [#18]

Perhaps I'm thick from just waking up, but how would you do 7-13 from the normal IDE?

degac

(Posted 2007) [#19]

Question: Will this also speed up maxgui apps?

This is a test that use some Assembly optimised SSE math routines.
The SSE instructions http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions were created to accelerate some calculations on datas.
So I can't really see how MaxGUI applications (I mean only the GUI of course...) can have better performances...or maybe with 70 (and more) instructions there are some surprises.

ziggy

(Posted 2007) [#20]

As long as the MaxGUI module makes calculations (and it does) it will work faster when calling 'fastered' math functions and the like. But most ot the work in the MaxGUI module is event based, so I won't expect to see a big performance improvement.

SebHoll

(Posted 2007) [#21]

Will this speed up MaxGUI?

I hope so!

I know next to nothing about assembly, but would I be right in saying that these new CPU extensions are not backwards compatible, i.e. if you get MinGW to compile in compliance with the Pentium-m instruction set, then this won't run on earlier P3 etc. processors? If this is the case, I can see why Mark doesn't want to implement this into BlitzMax as it would reduce the overall compatability of BlitzMax applications...

TeraBit

(Posted 2007) [#22]

As degac says, these kinds of things are most suited to math intensive routines which require processing on entire vectors (3D mainly).

The kind of thing I have in mind will require a recent (current generation P4->Core2 or Athlon XP->Athlon 64) PC to run, so SSE is fine.

The word appears to be that the next version of Max out next week will include an updated MingGW build, so much of this may indeed be a moot point. I'm sure everything will speed up a little, but nowhere near the order of magnitude you get with focused mathematics and vectorisation.

Derron

(Posted 2007) [#23]

Perhaps I'm thick from just waking up, but how would you do 7-13 from the normal IDE?

Hmm, I never knew how to do this in the regular IDE coz you only can press "CRTL + D" to build all (although this thing is grayed out in the menu). I think you better use blide - or do it manually (commandline).

But as it has been already said: the changes are not really significant when it comes to graphic intense things. It's mostly a boost to special calculations. So as long as your main routines relay on graphic display and not on massive calculations (10.000 particles linked together in some way) there is no real need to change the way BM works.
(My game didnt profit from the changes... too less calculations to stress even an old 800MHz Duron - the graphics would lag the same - so no advantage for me).

As I see you are doing a game framework you may prefer a check wether it makes your app running faster but I think you wont gain that speed boost you hope to see. Remember that the architecture you choose affects your application - older PCs wont get a chance to play a simple arkanoid with your framework (as long as I understood the subject of changing the aimed architecture correctly).

bye
MB

Grisu

(Posted 2007) [#24]

There will be an bmx update next week? Where did you read that?

TeraBit

(Posted 2007) [#25]

Hi Grisu,

There will be an bmx update next week? Where did you read that?

I heard about it Here .

Grisu

(Posted 2007) [#26]

Thanks Terabit.

With the usage of a newer version of mingw version. Can the user decide whether he wants to enable this SSE support or not? - So one could distribute 2 different exe files(optimised or not) if needed.

My gui app makes a lot of calculations. So a speed up in this area would be awesome. Perhaps some less visible "smearing" of windows and elements as well.

TeraBit

(Posted 2007) [#27]

Not sure really, but an updated MinGW was part of the steps mentioned above. We'll soon see I suppose :)

MGE	(Posted 2007) [#28]

Thanks MichaelB , I was wondering if the changes would only work on certain PC's. I guess I'll hold off on any mods like this for a while. :(

xlsior

(Posted 2007) [#29]

274 ms
19 ms
14.42 times faster

58 ms
29 ms
2 times faster

degac

(Posted 2007) [#30]

I just installed Bmax 1.26 and I tried the Terabit's matrix example

Old 1.24 - 355ms
New 1.26 - 321ms

Good! A 10% gain is very good!
Cheers!