SSE Speed test

BlitzMax Forums/BlitzMax Programming/SSE Speed test

DStastny(Posted 2008) [#1]
I have been working on ways to improve performance of vector/matrix operations without having to resort to inline coding of the functions.

I have seen/written the various other libraries written in BMax but have been unhappy with the overall performance with pure max generated code.

Soo I started experimenting with implementing my own Objects in pure C code. It has also opened up the ability to utilize SSE extensions intrinsics exposed by GCC.

So I wrote a simple test BlitzMax Module in C that exposes the TVector3 object. This is a Max object but written in C. It uses some C compiler tricks to maximize performance. No StackFrames in particular when dealing with straight x87 FPU code. I pretty much lost this gain when I implmented the next part but the code is on par if not bit better than straight bmx code but not much.

The big gain comes from this..

There is also a global switch that when enabled the vector math is done via SSE extentions(havent figured out how I want to implement SSE detection).. On my Dual Core I am seeing speed up 30-33%. I am bit worried on inconsistent timing but is stable on my dual core machine.


(Win32 test)
http://smokenmirrors.com/Downloads/test.zip

BMX code so you can see whats going on. Just simple adds two vectors in loop and times them.

TVector which is coded in max with similar logic to a TVector3 which is coded in C.

Unlike many of the other libraries the approach of this library is that a Vector is not immutable. ie. Operations dont make new Vector Objects they Operate on the object. The reason for this approach vs. other is to avoid random invokation of garbage collection or memory allocations. It requires more thought and planing when using but it will create a much more consistent frame when dealing with large amounts of objects and extensive amounts of math since you need to be careful of allocations. This also applies for none GC languages when using heap based objects.

This sample is test for me to see some metrics on various machines before I spend much time coding out the more important usage of this which is to get SSE enabled Matrix functions as well as vector normalization.

This is coded currently to only work on PentiumIV(AMD?) or better due to switches and way I am currently building the module with but should eventually be able to be done so that it will automatically scale on X86 Processers as well as be cross compilable to PowerPC(does anyone use these)

Any test results/success or failure with machine specs would be appreciatd. Worst it should do is crash with illegal op code if SSE intructions are invoked on invalid platform.

Thanks
Doug



SuperStrict
Framework brl.StandardIO
Import dbs.vector

Extern "win32"
	Function QueryPerformanceFrequency(LARGE_INTEGER:Long Var)
	Function QueryPerformanceCounter(LARGE_INTEGER:Long Var)
EndExtern


Global freq : Long
Global startcount : Long 
Global stopcount : Long 
QueryPerformanceFrequency(freq)

Function StartTimer()
	QueryPerformanceCounter(startcount)
End Function
Function StopTimer:Double()
    QueryPerformanceCounter(stopcount)
    Return Double(stopcount-startcount)/(Double(freq)/1000)
End Function




Type TVector
	Field X:Float
	Field Y:Float
	Field Z:Float
	
	Method ToString:String()
		Return "x="+X+ " y="+Y+" z="+Z
	End Method	
	
	Method Add:TVector(v1:TVector, v2:TVector)
		X=V1.X+V2.X
		Y=V1.Y+V2.Y
		Z=V1.Z+V2.Z	
		Return Self			
	End Method
End Type

Const MAXLOOP:Int= 200000000

Function TestCObject(sse:Int)
	EnableSSE(sse)
	Print ""
	If sse 
		Print("C Vector SSE Enabled")
	Else
		Print("C Vector SSE Disabled")
	End If
		
	Local v1:TVector3 = New TVector3
	v1.X=1
	v1.Y=2
	v1.Z=3
	Local v2:TVector3 = New TVector3
	v2.X=4
	v2.Y=5
	v2.Z=6
	Local v3:TVector3 = New TVector3


	StartTimer()
	For Local i:Int=0 To MAXLOOP
		v3.Add(v1,v2)
	Next
	Local secs:Double = StopTimer()
	Print  "x="+v3.X+ " y="+v3.Y+" z="+v3.Z
	Print "C Code Time ="+secs +"ms"
	Print "Ops ="+(MaxLoop)/secs +" per ms"
	Print ""
End Function

Function TestBMaxObject()
    Print ""
    Print("BMX test Object")
	Local vv1:TVector = New TVector
	vv1.X=1
	vv1.Y=2
	vv1.Z=3
	Local vv2:TVector = New TVector
	vv2.X=4
	vv2.Y=5
	vv2.Z=6
	Local vv3:TVector = New TVector
    StartTimer()
	For Local i:Int=0 To MAXLOOP
		vv3.Add(vv1,vv2)
	Next
	Local secs:Double = StopTimer()
	Print  "x="+vv3.X+ " y="+vv3.Y+" z="+vv3.Z
	Print "BMX Time="+secs +"ms"
	Print "Ops ="+(MaxLoop)/secs +" per ms"
	Print ""
End Function

Print "Starting Tests"
TestCObject(False)
TestBMaxObject()
TestCObject(True)
Input("Press Enter to Exit")




JoshK(Posted 2008) [#2]
Interesting. I think SSE is that thing Intel was bugging me about supporting.

I would be very interested in this if you could develop a full library to replace my vector and maybe matrix functions in the maths code I posted, based on Mark's code. For something as specialized and frequently used as vector maths, it makes sense.

AMD 3800+ single core:
Starting Tests

C Vector SSE Disabled
x=5.00000000 y=7.00000000 z=9.00000000
C Code Time =1350.0266095271886ms
Ops =148145.22809298162 per ms


BMX test Object
x=5.00000000 y=7.00000000 z=9.00000000
BMX Time=1678.5524417209449ms
Ops =119150.28391662815 per ms


C Vector SSE Enabled
x=5.00000000 y=7.00000000 z=9.00000000
C Code Time =1282.3249882317446ms
Ops =155966.70254066325 per ms

Press Enter to Exit



Dreamora(Posted 2008) [#3]
use GCC -O3 already does a fairly well job for vectorization. modify BMK to use it and a higher march than P1 (which is the reason its so slow, it opts for totally unrealistic and off reality processor behavior) and you should easily be able to top those benchmarks. The parallelization with P3 upwards goes that far that it starts to use more than one core for example when doing pure maths and the like.


ziggy(Posted 2008) [#4]
Is there any way to set up this parameters on the BMK compiler?


degac(Posted 2008) [#5]
Starting Tests

C Vector SSE Disabled
x=5.00000000 y=7.00000000 z=9.00000000
C Code Time =1475.9208223391520ms
Ops =135508.62415710400 per ms


BMX test Object
x=5.00000000 y=7.00000000 z=9.00000000
BMX Time=1836.8745189681929ms
Ops =108880.60013611805 per ms


C Vector SSE Enabled
x=5.00000000 y=7.00000000 z=9.00000000
C Code Time =1408.1261724604663ms
Ops =142032.72683337267 per ms

Press Enter to Exit



Is there any way to set up this parameters on the BMK compiler?


Good question.


slenkar(Posted 2008) [#6]
Starting Tests

C Vector SSE Disabled
x=5.00000000 y=7.00000000 z=9.00000000
C Code Time =3249.6968190091197ms
Ops =61544.202779194318 per ms


BMX test Object
x=5.00000000 y=7.00000000 z=9.00000000
BMX Time=3108.4587566296832ms
Ops =64340.567354623068 per ms


C Vector SSE Enabled
x=5.00000000 y=7.00000000 z=9.00000000
C Code Time =2759.5233472410600ms
Ops =72476.284790254707 per ms

Press Enter to Exit


I did this test on a PC that is about 6 years old, so I dunno if it has SSE.

Strangely the BMX code was faster than plain C


Perturbatio(Posted 2008) [#7]
Starting Tests

C Vector SSE Disabled
x=5.00000000 y=7.00000000 z=9.00000000
C Code Time =1245.1944639952001ms
Ops =160617.48247603112 per ms


BMX test Object
x=5.00000000 y=7.00000000 z=9.00000000
BMX Time=1555.7853588557232ms
Ops =128552.43743076459 per ms


C Vector SSE Enabled
x=5.00000000 y=7.00000000 z=9.00000000
C Code Time =1205.9277411761991ms
Ops =165847.41620167924 per ms

Press Enter to Exit



jsp(Posted 2008) [#8]
Starting Tests

C Vector SSE Disabled
x=5.00000000 y=7.00000000 z=9.00000000
C Code Time =1919.9758628540778ms
Ops =104167.97620710527 per ms


BMX test Object
x=5.00000000 y=7.00000000 z=9.00000000
BMX Time=1922.8306949626278ms
Ops =104013.31772160377 per ms


C Vector SSE Enabled
x=5.00000000 y=7.00000000 z=9.00000000
C Code Time =1590.0347669885418ms
Ops =125783.41313806080 per ms

Press Enter To Exit

Intel(R) Pentium(R) M processor 1.73GHz, 1528MB RAM
Mobile Intel(R) 915GM/GMS,910GML Express


DStastny(Posted 2008) [#9]
@Jermey what CPU is on that machine and speed?

That is strange, I suspect problem is QueryPerformanceCounter acting wonky. It has history of problems on some BIOSes and CPU's.

If you run multiple test is results inconsistent.


DStastny(Posted 2008) [#10]
@Dremora and others concerning the -O3 vs -Os(blitz default).

That flag is GCC optimization level has nothing do do with generation of SSE vs x87 Floating point code. Now -O3 is much faster at expense of -Os which is optimize for size.

I am controlling all optimizations of C code outside of BMK although I have modified my version of BMK allow setting modules to set the -Ox as a flag. Currentl with Bruceys changes the -Os is set as last option so it overrides any changes via Bruceys settings.

Just setting -O3 in BMK should produce better C comipled code it has no impact on BCC generated code as that is straight ASM output. Although when you turn up optimizations sometimes the compiler will toast your code. I assume Mark is erroring on side of caution with debugging C problems caused by compiler optimizations screwing it up. But if you have a specific module and test the heck out of your functions go for -O3.

There is also BCC_OPTS envoriment variable that you can set that will control all C code generation outside of the MODULEINFO changes. This is global but not so good would be better if there was something settable via a config file but it works. There are global Link Options you can set too.


@Dremora to turn on SSE you need to set -march=PentiumIII or Better as well -mfpmath=SSE.

This does not mean it will use SSE it will try, 3.4 is not so good and all my research indicates use intrinics.

That works better for what I am attempting.

Know in my case I am trying to generate code that falls back to straight Pentium x87 math in case SSE is not enabled.

For that you need to use intrinsics.

You really need to test stuff out and read the generated ASM output to understand what flags are doing to your code other than what you think its doing.

I have spent two weeks fliping the flags and examing the ASM generated. GCC -S instead of -c or objdump if you want to see what is actually if you dont want ASM listing.

Doug


DStastny(Posted 2008) [#11]
@Josh - I have looked at Mark/your implmenetation. I am not sure that the benifits would be readily visible since it implmented as immutable object which means it would spend unfixe amounts of time in GC memory manager. My goal is to ensure all memory allocations are controled by programmer. I will see about extending the example to show you the difference and to see for my self how bad the overhead of the the GC is. This is no different than calling malloc. Creation of object is going to be significantly slower in all cases. But the match would be as fast as possible :)

Other thing I have been thinking about, I am pretty sure that code is basis for BlitzMax are the Matrix Operations code to DX9 or real math as in OpenGL. Now for me I want DirectX Matrixs but thought a good libary could do transformations. I know Ogre does all math as real math and transforms in the driver level by doing the transpostion.


Doug


slenkar(Posted 2008) [#12]
its an intel celeron 2GHZ


DStastny(Posted 2008) [#13]
@Jeremy Interesting.

I didnt even know Celerons supported SSE. I will have to look up its specs. I think I have idea though on the speed difference.

The ASM for both the BMax Vector and my Plain C Vector are the same there is one substantal difference. To handle the SSE intstructions my underlying data structures to handle both the SSE code and x87 code had to do some memory alignment adjustments to make the data for the vector be aligned on a 16byte boundry. I suspect with that and the size and limitations of the caching mechnisms in the Celeron architecture are cause of cache misses which are causing the additional overhead with the C Based structures.

Hmmm interesting...
Thanks for feedback
Doug


DStastny(Posted 2008) [#14]
FYI thanks to all for feedback. It is very interesting to see how this piece of code runs on various CPUs.

Its seems that the no stack frames has huge boost on AMD although the SSE implementations are up to par as Intel although still faster than x87

Doug


REDi(Posted 2008) [#15]
C Vector SSE Disabled
x=5.00000000 y=7.00000000 z=9.00000000
C Code Time =2200.1195682691514ms
Ops =90904.150340038716 per ms


BMX test Object
x=5.00000000 y=7.00000000 z=9.00000000
BMX Time=2438.5532239432664ms
Ops =82015.843671679089 per ms


C Vector SSE Enabled
x=5.00000000 y=7.00000000 z=9.00000000
C Code Time =2142.3499355364997ms
Ops =93355.430260236564 per ms