Leadworks Josh

BlitzMax Forums/BlitzMax Programming/Leadworks Josh

DStastny(Posted 2008) [#1]
@Josh I did a simple test with just a BMX version using a the concept of Immutable Vectors as Mark/your lib vs Mutable Vector Objects my approach.

I was expecting poor performance based upon my understanding of architectural knowlege of Garbage Collector based language or heap based objects. I think you will be supprised at how bad it really is. I know I was as I waited and waited and waited for it to complete..

This is nothing more than simple add two vectors.

Building newtest
Compiling:newtest.bmx
flat assembler  version 1.66
3 passes, 5367 bytes.
Linking:newtest.exe
Executing:newtest.exe
Starting Tests

BMX test Object
x=5.00000000 y=7.00000000 z=9.00000000
BMX Time=1682.7540232068600ms
Ops =118852.78373535293 per ms


BMX test Immutable Object
x=5.00000000 y=7.00000000 z=9.00000000
BMX Immutable Time=25544.810862833125ms
Ops =7829.3787757494629 per ms

Press Enter to Exit

BMX Code
SuperStrict
Framework brl.StandardIO


Extern "win32"
	Function QueryPerformanceFrequency(LARGE_INTEGER:Long Var)
	Function QueryPerformanceCounter(LARGE_INTEGER:Long Var)
EndExtern

Global freq : Long
Global startcount : Long 
Global stopcount : Long 
QueryPerformanceFrequency(freq)

Function StartTimer()
	QueryPerformanceCounter(startcount)
End Function
Function StopTimer:Double()
    QueryPerformanceCounter(stopcount)
    Return Double(stopcount-startcount)/(Double(freq)/1000)
End Function


Function Vector : TVector(x:Float,y:Float, z:Float)
	Local this : TVector = New TVector
	this.x = x
	this.y = y
	this.z = z
	Return this
End Function

Type TVector
	Field X:Float
	Field Y:Float
	Field Z:Float
	
	Method ToString:String()
		Return "x="+X+ " y="+Y+" z="+Z
	End Method	
	
	Method Add:TVector(v1:TVector, v2:TVector)
		X=V1.X+V2.X
		Y=V1.Y+V2.Y
		Z=V1.Z+V2.Z	
		Return Self			
	End Method
	
	Method Plus:TVector( v:TVector)
		Return Vector ( x+v.x,y+v.y,z+v.z )
	End Method

End Type

Const MAXLOOP:Int= 200000000

Function TestBMaxObject()
    Print ""
    Print("BMX test Object")
	Local vv1:TVector = New TVector
	vv1.X=1
	vv1.Y=2
	vv1.Z=3
	Local vv2:TVector = New TVector
	vv2.X=4
	vv2.Y=5
	vv2.Z=6
	Local vv3:TVector = New TVector
    StartTimer()
	For Local i:Int=0 To MAXLOOP
		vv3.Add(vv1,vv2)
	Next
	Local secs:Double = StopTimer()
	Print  "x="+vv3.X+ " y="+vv3.Y+" z="+vv3.Z
	Print "BMX Time="+secs +"ms"
	Print "Ops ="+(MaxLoop)/secs +" per ms"
	Print ""
End Function

Function TestBMaxObjectImutable()
    Print ""
    Print("BMX test Immutable Object")
	Local vv1:TVector = New TVector
	vv1.X=1
	vv1.Y=2
	vv1.Z=3
	Local vv2:TVector = New TVector
	vv2.X=4
	vv2.Y=5
	vv2.Z=6
	Local vv3:TVector 
    StartTimer()
	For Local i:Int=0 To MAXLOOP
		vv3=vv1.Plus(vv2)
	Next
	Local secs:Double = StopTimer()
	Print  "x="+vv3.X+ " y="+vv3.Y+" z="+vv3.Z
	Print "BMX Immutable Time="+secs +"ms"
	Print "Ops ="+(MaxLoop)/secs +" per ms"
	Print ""
End Function

Print "Starting Tests"
TestBMaxObject()
TestBMaxObjectImutable
Input("Press Enter to Exit")


Now I know how much you are doing in regards with these Ops in your engine I suspect you might want to reconsider Immutable Vector/Matrix operations. The key here is you preallocate all your vector/matrix/quat and minimize in real time object creation.

As good as Marks GC is. It really is very fast. It still sucks alot of CPU cycles. This is no different that any object creation in any enviroment.

Now one way that might be able to speed it up is way that objects can have there own heaps instead of all sharing the same global GC heap. In C++/Object Pascal(Delphi)/D you can create custom allocators heaps per class. This way you can introduce allocation pools and increase the speed tremendouly. Still slower than preallocating but significantly faster non the less.

Myself coming from 8bit days am a speed freak, when it comes to heavly used routines such as math. Ideally I would like inlining. In that case even inline x87 code will beat overhead of function call to SSE optimized code.

My routines are pretty much best that you can do with function calls.

Thought you would find this interesting.

Doug


Kurator(Posted 2008) [#2]
ouch :)


Beaker(Posted 2008) [#3]
Interesting.


JoshK(Posted 2008) [#4]
That's interesting, but I think you are focusing on one minor part of the program loop that is not significant compared to the rest of the processing that occurs.


DStastny(Posted 2008) [#5]
Um Josh, its not the math is primary problem its the allcoations.

The routines are using are wasting valablue CPU time doing trival math.

The numers ar quite clear, this is common knowlege and technique issue not specific to BMax but to C++/C I dont care what high level language. Allocating/Deallocating is slow.

My contention on the math routines as Mark wrote them is they waste a boatload of CPU cycles.

Doug

Doug


sswift(Posted 2008) [#6]
I'm confused.


Arowx(Posted 2008) [#7]
Me too, so are there two separate arguments here?

1. The maths routines in BMax are not as optomised as they could be, e.g. use of CPU cycles.

2. Don't create new objects where you don't need to as creation and garbage collection are slow processes. Especially in heavily used code areas e.g. Vectors.

?


Koriolis(Posted 2008) [#8]
The real issue is number 2. Budman is totally right that allocation is relatively slow compared to simple math, so if you allocate/deallocate objects everytime you do simple computations, the mere allocation/deallcoation will be several orders of maginutude slower than the raw computation. Or said another way, removing the need for continuous allocation/deallocation would make it several orders of magnitude *faster*.
And it's worse in a garbage collected environment.
However what Josh says is that these computations are such a small fraction of the overall processing that an overhead there can be neglected (in favor to increased ease of use).
Well, that's my understanding.


Dreamora(Posted 2008) [#9]
I don't think it can be neglected, given that Josh is using it for a 3D engine which uses higher end effects which base heavily on shaders which themself base on the vertex data given in (normals, matrices etc) ... something that will be generated by this "neglected" code ...

Using a pooling system for something that will need only a given amount of objects of the same type but over and over again is a great way to improve its performance.

Matrices / Vectors are no different than particles in this aspect. Given amount of objects which are created and destroyed on a quite high frequency.


Kurator(Posted 2008) [#10]
	Method Add:TVector(v1:TVector, v2:TVector)
		X=V1.X+V2.X
		Y=V1.Y+V2.Y
		Z=V1.Z+V2.Z	
		Return Self			
	End Method


This returns a reference to an existing Objecte - therefore it is faster.

	Method Plus:TVector( v:TVector)
		Return Vector ( x+v.x,y+v.y,z+v.z )
	End Method


This adds two vectors, returning a NEW Object - without touching the values of the Object which Mehtod has bin called.

IMHO the two methods are note compareable in their function, because they are doing different things.

For often used Objects (like Particels in a Particle system) you also want to reuse existing Objects, not throwing them away an instance them from new.

I think thats the exact same topic here, programmers have to be aware of it.


JoshK(Posted 2008) [#11]
There's a few spots in the engine where math functions get called many times, and for those I have implemented a pure-floating point approach, after working out the routine with the math objects.

For things that don't get called that many times each frame, it doesn't really matter if you use objects like this.


JoshK(Posted 2008) [#12]
We have externed a few maths functions as C or CPP code using more raw math, but the difference is not as stark as you might think.

There is no discernable difference between the speed of imported CPP code and BlitzMax when performing raw maths.

Many operations cannot be written to one of the parameters. For example, you cannot multiply two matrices together and write the output to one of them because the function doesn't progress in order:
Function MultMat4( mat1:Float Ptr, mat2:Float Ptr, result:Float Ptr )
	result[0] = mat1[0]*mat2[0] + mat1[4]*mat2[1] + mat1[8]*mat2[2] + mat1[12]*mat2[3]
	result[1] = mat1[1]*mat2[0] + mat1[5]*mat2[1] + mat1[9]*mat2[2] + mat1[13]*mat2[3]
	result[2] = mat1[2]*mat2[0] + mat1[6]*mat2[1] + mat1[10]*mat2[2] + mat1[14]*mat2[3]
	result[3] = mat1[3]*mat2[0] + mat1[7]*mat2[1] + mat1[11]*mat2[2] + mat1[15]*mat2[3]
	result[4] = mat1[0]*mat2[4] + mat1[4]*mat2[5] + mat1[8]*mat2[6] + mat1[12]*mat2[7]
	result[5] = mat1[1]*mat2[4] + mat1[5]*mat2[5] + mat1[9]*mat2[6] + mat1[13]*mat2[7]
	result[6] = mat1[2]*mat2[4] + mat1[6]*mat2[5] + mat1[10]*mat2[6] + mat1[14]*mat2[7]
	result[7] = mat1[3]*mat2[4] + mat1[7]*mat2[5] + mat1[11]*mat2[6] + mat1[15]*mat2[7]
	result[8] = mat1[0]*mat2[8] + mat1[4]*mat2[9] + mat1[8]*mat2[10] + mat1[12]*mat2[11]
	result[9] = mat1[1]*mat2[8] + mat1[5]*mat2[9] + mat1[9]*mat2[10] + mat1[13]*mat2[11]
	result[10] = mat1[2]*mat2[8] + mat1[6]*mat2[9] + mat1[10]*mat2[10] + mat1[14]*mat2[11]
	result[11] = mat1[3]*mat2[8] + mat1[7]*mat2[9] + mat1[11]*mat2[10] + mat1[15]*mat2[11]
	result[12] = mat1[0]*mat2[12] + mat1[4]*mat2[13] + mat1[8]*mat2[14] + mat1[12]*mat2[15]
	result[13] = mat1[1]*mat2[12] + mat1[5]*mat2[13] + mat1[9]*mat2[14] + mat1[13]*mat2[15]
	result[14] = mat1[2]*mat2[12] + mat1[6]*mat2[13] + mat1[10]*mat2[14] + mat1[14]*mat2[15]
	result[15] = mat1[3]*mat2[12] + mat1[7]*mat2[13] + mat1[11]*mat2[14] + mat1[15]*mat2[15]
EndFunction


If mat2 and result are the same pointer, the results will be wrong. I tried doing a memcopy, but I actually found that the original BMX method of creating a new mat4 object and returning it was about 20% faster, I guess because I am allocating new memory anyways. You are right that it is better to write to an existing bit of memory, but that is not always possible, especially when it comes to matrices.

I will keep experimenting with it, but so far the only reason I have found to extern code to C is because it is a good learning experience, and it allows me to say the engine is written in C++ and BlitzMax.


JoshK(Posted 2008) [#13]
Hmmmm, I found in a couple of functions I can use a global array declared in the function, and use this as a temporary mem holder, then use memcopy to copy the results to the object. This does give a significant speed increase.

Thanks for the tip.


DStastny(Posted 2008) [#14]
If you want to pump up speed. With C/C++ code its all about inlining.

That wont work but what you can do is tweak the GCC comipler setting when compilering the C Module with your functions and dump the stack frame. That is huge pick up in functional speed. Not so much the C vs BMAX as MAX floating code is actually pretty good. The lack of inlining and GC is what makes it pretty much suck for math intensive code. That and as a lot of your recent comments.

Its a language that is so so close... but yet so far.