Leadworks Josh
BlitzMax Forums/BlitzMax Programming/Leadworks Josh
| ||
@Josh I did a simple test with just a BMX version using a the concept of Immutable Vectors as Mark/your lib vs Mutable Vector Objects my approach. I was expecting poor performance based upon my understanding of architectural knowlege of Garbage Collector based language or heap based objects. I think you will be supprised at how bad it really is. I know I was as I waited and waited and waited for it to complete.. This is nothing more than simple add two vectors. Building newtest Compiling:newtest.bmx flat assembler version 1.66 3 passes, 5367 bytes. Linking:newtest.exe Executing:newtest.exe Starting Tests BMX test Object x=5.00000000 y=7.00000000 z=9.00000000 BMX Time=1682.7540232068600ms Ops =118852.78373535293 per ms BMX test Immutable Object x=5.00000000 y=7.00000000 z=9.00000000 BMX Immutable Time=25544.810862833125ms Ops =7829.3787757494629 per ms Press Enter to Exit BMX Code SuperStrict Framework brl.StandardIO Extern "win32" Function QueryPerformanceFrequency(LARGE_INTEGER:Long Var) Function QueryPerformanceCounter(LARGE_INTEGER:Long Var) EndExtern Global freq : Long Global startcount : Long Global stopcount : Long QueryPerformanceFrequency(freq) Function StartTimer() QueryPerformanceCounter(startcount) End Function Function StopTimer:Double() QueryPerformanceCounter(stopcount) Return Double(stopcount-startcount)/(Double(freq)/1000) End Function Function Vector : TVector(x:Float,y:Float, z:Float) Local this : TVector = New TVector this.x = x this.y = y this.z = z Return this End Function Type TVector Field X:Float Field Y:Float Field Z:Float Method ToString:String() Return "x="+X+ " y="+Y+" z="+Z End Method Method Add:TVector(v1:TVector, v2:TVector) X=V1.X+V2.X Y=V1.Y+V2.Y Z=V1.Z+V2.Z Return Self End Method Method Plus:TVector( v:TVector) Return Vector ( x+v.x,y+v.y,z+v.z ) End Method End Type Const MAXLOOP:Int= 200000000 Function TestBMaxObject() Print "" Print("BMX test Object") Local vv1:TVector = New TVector vv1.X=1 vv1.Y=2 vv1.Z=3 Local vv2:TVector = New TVector vv2.X=4 vv2.Y=5 vv2.Z=6 Local vv3:TVector = New TVector StartTimer() For Local i:Int=0 To MAXLOOP vv3.Add(vv1,vv2) Next Local secs:Double = StopTimer() Print "x="+vv3.X+ " y="+vv3.Y+" z="+vv3.Z Print "BMX Time="+secs +"ms" Print "Ops ="+(MaxLoop)/secs +" per ms" Print "" End Function Function TestBMaxObjectImutable() Print "" Print("BMX test Immutable Object") Local vv1:TVector = New TVector vv1.X=1 vv1.Y=2 vv1.Z=3 Local vv2:TVector = New TVector vv2.X=4 vv2.Y=5 vv2.Z=6 Local vv3:TVector StartTimer() For Local i:Int=0 To MAXLOOP vv3=vv1.Plus(vv2) Next Local secs:Double = StopTimer() Print "x="+vv3.X+ " y="+vv3.Y+" z="+vv3.Z Print "BMX Immutable Time="+secs +"ms" Print "Ops ="+(MaxLoop)/secs +" per ms" Print "" End Function Print "Starting Tests" TestBMaxObject() TestBMaxObjectImutable Input("Press Enter to Exit") Now I know how much you are doing in regards with these Ops in your engine I suspect you might want to reconsider Immutable Vector/Matrix operations. The key here is you preallocate all your vector/matrix/quat and minimize in real time object creation. As good as Marks GC is. It really is very fast. It still sucks alot of CPU cycles. This is no different that any object creation in any enviroment. Now one way that might be able to speed it up is way that objects can have there own heaps instead of all sharing the same global GC heap. In C++/Object Pascal(Delphi)/D you can create custom allocators heaps per class. This way you can introduce allocation pools and increase the speed tremendouly. Still slower than preallocating but significantly faster non the less. Myself coming from 8bit days am a speed freak, when it comes to heavly used routines such as math. Ideally I would like inlining. In that case even inline x87 code will beat overhead of function call to SSE optimized code. My routines are pretty much best that you can do with function calls. Thought you would find this interesting. Doug |
| ||
ouch :) |
| ||
Interesting. |
| ||
That's interesting, but I think you are focusing on one minor part of the program loop that is not significant compared to the rest of the processing that occurs. |
| ||
Um Josh, its not the math is primary problem its the allcoations. The routines are using are wasting valablue CPU time doing trival math. The numers ar quite clear, this is common knowlege and technique issue not specific to BMax but to C++/C I dont care what high level language. Allocating/Deallocating is slow. My contention on the math routines as Mark wrote them is they waste a boatload of CPU cycles. Doug Doug |
| ||
I'm confused. |
| ||
Me too, so are there two separate arguments here? 1. The maths routines in BMax are not as optomised as they could be, e.g. use of CPU cycles. 2. Don't create new objects where you don't need to as creation and garbage collection are slow processes. Especially in heavily used code areas e.g. Vectors. ? |
| ||
The real issue is number 2. Budman is totally right that allocation is relatively slow compared to simple math, so if you allocate/deallocate objects everytime you do simple computations, the mere allocation/deallcoation will be several orders of maginutude slower than the raw computation. Or said another way, removing the need for continuous allocation/deallocation would make it several orders of magnitude *faster*. And it's worse in a garbage collected environment. However what Josh says is that these computations are such a small fraction of the overall processing that an overhead there can be neglected (in favor to increased ease of use). Well, that's my understanding. |
| ||
I don't think it can be neglected, given that Josh is using it for a 3D engine which uses higher end effects which base heavily on shaders which themself base on the vertex data given in (normals, matrices etc) ... something that will be generated by this "neglected" code ... Using a pooling system for something that will need only a given amount of objects of the same type but over and over again is a great way to improve its performance. Matrices / Vectors are no different than particles in this aspect. Given amount of objects which are created and destroyed on a quite high frequency. |
| ||
Method Add:TVector(v1:TVector, v2:TVector) X=V1.X+V2.X Y=V1.Y+V2.Y Z=V1.Z+V2.Z Return Self End Method This returns a reference to an existing Objecte - therefore it is faster. Method Plus:TVector( v:TVector) Return Vector ( x+v.x,y+v.y,z+v.z ) End Method This adds two vectors, returning a NEW Object - without touching the values of the Object which Mehtod has bin called. IMHO the two methods are note compareable in their function, because they are doing different things. For often used Objects (like Particels in a Particle system) you also want to reuse existing Objects, not throwing them away an instance them from new. I think thats the exact same topic here, programmers have to be aware of it. |
| ||
There's a few spots in the engine where math functions get called many times, and for those I have implemented a pure-floating point approach, after working out the routine with the math objects. For things that don't get called that many times each frame, it doesn't really matter if you use objects like this. |
| ||
We have externed a few maths functions as C or CPP code using more raw math, but the difference is not as stark as you might think. There is no discernable difference between the speed of imported CPP code and BlitzMax when performing raw maths. Many operations cannot be written to one of the parameters. For example, you cannot multiply two matrices together and write the output to one of them because the function doesn't progress in order: Function MultMat4( mat1:Float Ptr, mat2:Float Ptr, result:Float Ptr ) result[0] = mat1[0]*mat2[0] + mat1[4]*mat2[1] + mat1[8]*mat2[2] + mat1[12]*mat2[3] result[1] = mat1[1]*mat2[0] + mat1[5]*mat2[1] + mat1[9]*mat2[2] + mat1[13]*mat2[3] result[2] = mat1[2]*mat2[0] + mat1[6]*mat2[1] + mat1[10]*mat2[2] + mat1[14]*mat2[3] result[3] = mat1[3]*mat2[0] + mat1[7]*mat2[1] + mat1[11]*mat2[2] + mat1[15]*mat2[3] result[4] = mat1[0]*mat2[4] + mat1[4]*mat2[5] + mat1[8]*mat2[6] + mat1[12]*mat2[7] result[5] = mat1[1]*mat2[4] + mat1[5]*mat2[5] + mat1[9]*mat2[6] + mat1[13]*mat2[7] result[6] = mat1[2]*mat2[4] + mat1[6]*mat2[5] + mat1[10]*mat2[6] + mat1[14]*mat2[7] result[7] = mat1[3]*mat2[4] + mat1[7]*mat2[5] + mat1[11]*mat2[6] + mat1[15]*mat2[7] result[8] = mat1[0]*mat2[8] + mat1[4]*mat2[9] + mat1[8]*mat2[10] + mat1[12]*mat2[11] result[9] = mat1[1]*mat2[8] + mat1[5]*mat2[9] + mat1[9]*mat2[10] + mat1[13]*mat2[11] result[10] = mat1[2]*mat2[8] + mat1[6]*mat2[9] + mat1[10]*mat2[10] + mat1[14]*mat2[11] result[11] = mat1[3]*mat2[8] + mat1[7]*mat2[9] + mat1[11]*mat2[10] + mat1[15]*mat2[11] result[12] = mat1[0]*mat2[12] + mat1[4]*mat2[13] + mat1[8]*mat2[14] + mat1[12]*mat2[15] result[13] = mat1[1]*mat2[12] + mat1[5]*mat2[13] + mat1[9]*mat2[14] + mat1[13]*mat2[15] result[14] = mat1[2]*mat2[12] + mat1[6]*mat2[13] + mat1[10]*mat2[14] + mat1[14]*mat2[15] result[15] = mat1[3]*mat2[12] + mat1[7]*mat2[13] + mat1[11]*mat2[14] + mat1[15]*mat2[15] EndFunction If mat2 and result are the same pointer, the results will be wrong. I tried doing a memcopy, but I actually found that the original BMX method of creating a new mat4 object and returning it was about 20% faster, I guess because I am allocating new memory anyways. You are right that it is better to write to an existing bit of memory, but that is not always possible, especially when it comes to matrices. I will keep experimenting with it, but so far the only reason I have found to extern code to C is because it is a good learning experience, and it allows me to say the engine is written in C++ and BlitzMax. |
| ||
Hmmmm, I found in a couple of functions I can use a global array declared in the function, and use this as a temporary mem holder, then use memcopy to copy the results to the object. This does give a significant speed increase. Thanks for the tip. |
| ||
If you want to pump up speed. With C/C++ code its all about inlining. That wont work but what you can do is tweak the GCC comipler setting when compilering the C Module with your functions and dump the stack frame. That is huge pick up in functional speed. Not so much the C vs BMAX as MAX floating code is actually pretty good. The lack of inlining and GC is what makes it pretty much suck for math intensive code. That and as a lot of your recent comments. Its a language that is so so close... but yet so far. |