Which is most optimised?

BlitzMax Forums/BlitzMax Programming/Which is most optimised?

Grey Alien

(Posted 2008) [#1]

Hi, say I have a method with about 10 params, some are passed in as copies, and some are passed in as variables e.g.

MyMethod(x,y,z,a var, b var c var)

and I call that function maybe 100+ times per logic iteration, clearly there's a lot of variable passing going on. So would it be any quicker if I made those params fields of the containing type so that no passing needs to go on?

I ask this because I wonder if the compiler is optimised to use params quicker than fields of a potential big type or vice cersa? OR maybe it makes no difference.

Has anyone tested this? or got any wild theories? Thx

Oh yes, I know I could inline the function but that would be a pain as there are no Macros in BMax and you can't have a sub-function inside a method that uses the method's local variables (I tried this a while back and it just crashes + we had a whole discussion about it then)

Thanks for any feedback.

Gabriel

(Posted 2008) [#2]

You already predicted my suggestion, because I think anything you do 100 times per logic loop and requires 10 parameters is crying out to be inlined, but you offer solid reasons against that, so we'll push that aside.

It's very hard to predict compiler behaviour. Kind of off-topic, but someone recently showed me an example of a loop ( in another language, not BMax ) whereby doing math in the loop was faster than doing nothing in the loop. Because the compiler identifies loops with heavy math in them and optimizes them. So I'm pretty wary of predicting what a compiler would or wouldn't do. Still, in theory, I would expect it to be faster to have the variables belong to the type. If it's a particularly big type with a lot of members, perhaps not.

Grey Alien

(Posted 2008) [#3]

interesting about the loop.

Yeah if the type is pretty big (which it is) I guess it's fields aren't like to be in any "quick access" memory, that's what I was thinking.

Course I could try to make a test app...

sswift

(Posted 2008) [#4]

Have you actually profiled this to see if enough time is being spent in your logic loop to even warrant optimizing it?

In my experience, optimizing your graphics routines is worthwhile. Optimizing logic loops, not so much.

CS_TBL

(Posted 2008) [#5]

Why not simply run such such a call a million times and check how many millisecs that costs? Then you have your answer..

Grey Alien

(Posted 2008) [#6]

Have you actually profiled this to see if enough time is being spent in your logic loop to even warrant optimizing it?

Yes it's an expensive function that's why I'm posting about it. I'll probably have to write a test app to compare the two methods, but I thought I'd post here in case anyone had anything concrete.

sswift

(Posted 2008) [#7]

What are you doing where a simple logic loop is affecting your framerate? And have you looked into whether you could improve the algorithm to reduce th enumber of these function calls rather than reducing the time each function call takes?

Grey Alien

(Posted 2008) [#8]

I'm testing a large level stored in an array so it could be 100-200 times (and I do this 200 times a second). The algo is pretty neat and optimised, also includes early loop exiting and stuff like that.

ImaginaryHuman

(Posted 2008) [#9]

I did extensive testing of how much overhead is required for calling functions and getting variables to be accessible within the function in different ways. I found that Var was not any faster than not using it. It seems to provide a different kind of functionality but it really did not seem the slightest bit faster for me (or slower). What I actually found is that if you do not `pass` the variables at all, but instead make them Globals outside of the function, then that IS faster than any form of parameter passing. The speedup is noticeable.

What it comes down to is the issue of the function being essentially a separation or a barrier between two parts of code. In order to overcome that barrier you have to copy variables from one side of the barrier to the other, and it is this copying which we call parameter passing. However, removing the barrier entirely will always be more efficient than any of the variations in parameter passing simply because you don't have to do an extra read and write for every variable you access.

I created a multitasking script engine which calls lots of functions and in my tests I found that defining as much to be Global as possible was by far the most efficient way to share variables with functions. I don't have any statistics in terms of percentages off-hand, but I do recall Globals were the clear winner. So if you want the extra speed use Globals (which you might want to turn into temporary Local variables within the function for even faster utilization), but otherwise using Var or not using it probably won't have hardly any impact if any. Another option is to store variables in arrays and pass only the handle of the array to the function or make it a global array.

It is always more efficient to SHARE than to set up separations and boundaries and to have to then create added complexity and degradation of performance in trying to overcome those boundaries. If you are battling with boundaries, get rid of them where possible.

As to the mention of some code being faster with code in a loop than in an empty loop, I have also found that there IS some benefit to arranging the instructions you call to parallelize the CPU's pathways. This is something people did a lot in assembly language where basically you have full knowledge of which instructions use which parts of the cpu, which do memory accesses, etc, and you understand what your system structure is, how fast the cache is and how it behaves, what happens when you read/write to memory, etc... and then you can streamline your code. For example if you are reading a value from memory, it takes time for the memory bus to read the data, and if you try to use the value in the very next instruction it will have to sit and wait for the memory-fetch part of the cpu to finish. Therefore if there is some other instruction you can call right after the memory read, which does not depend on the result of that previous instruction, it can sometimes be processed for free, or at least with some increase in throughput. It's not quite so easy to predict or to know how things are behaving at a low level when you're working in a higher level language, because a single instruction could be calling thousands of assembly instructions. But in terms of memory accesses (and I know this may make your code be a little `out of order` and possibly harder to follow), I have definitely noticed there IS a real improvement when you consider the memory accesses in your blitzmax code and try to do unrelated processing in-between. Something to consider.

AlexO

(Posted 2008) [#10]

anyone care to explain why passing a large type vs a small type to a function is slower? A type is just a reference (pointer) to some memory location of where the actual data for the type is.

Also why would reading fields from a type be any slower based on the size of the type? Offsets to fields I don't recall are computed iteratively [unless done at compile time].

I ask this because I wonder if the compiler is optimised to use params quicker than fields of a potential big type or vice cersa?

only scenario I can see happening where it might matter is if you have a large param list of primitives (int, float, etc) which aren't objects. So the method would copy them by value each time the method was called, instead of just directly accessing them inside the method through the type's fields. I guess if you had a large list of objects in the param list it could cause the same issue due to copying references (not the actual object) to the variable names.

Azathoth

(Posted 2008) [#11]

anyone care to explain why passing a large type vs a small type to a function is slower?

Why would it be? I made a test and there is no difference (The only difference in time was which loop came first, not the size of the type)

Strict

Type Small
	Field x:Int
EndType

Type Big
	Field a:Int
	Field b:Float
	Field c:Long
	Field z:String
EndType

Function FuncParam(a,b,c,d)
EndFunction


Function FuncSmall(a:Small)
EndFunction


Function FuncBig(a:Big)
EndFunction

Local aSmall:Small=New Small
Local aBig:Big=New Big

Local a_i=10, b_i=2, c_i=34, d_i=0
Local m:Int

Const loop=100000000

m=MilliSecs()
For Local i=0 Until loop
	FuncParam(a_i,b_i,c_i,d_i)
Next
Print MilliSecs()-m

m=MilliSecs()
For Local i=0 Until loop
	FuncSmall(aSmall)
Next
Print MilliSecs()-m

m=MilliSecs()
For Local i=0 Until loop
	FuncBig(aBig)
Next
Print MilliSecs()-m

End

(On my computer the last loop is the slower of the Type tests, doesn't matter if its the Small or Big Type)

Gabriel

(Posted 2008) [#12]

anyone care to explain why passing a large type vs a small type to a function is slower? A type is just a reference (pointer) to some memory location of where the actual data for the type is.

No one said it was, did they? I Know I didn't.

Also why would reading fields from a type be any slower based on the size of the type? Offsets to fields I don't recall are computed iteratively [unless done at compile time].

I didn't say that they were, I said that they might be. My thinking being that there are a limited number of registers available to the CPU and the more fields you have, the less chance of the fields you want being in those registers. Of course, this presumes a level of optimization which might not be present in the first instance, so it may well be that there is no speed difference because they're both slow ( relatively speaking. )

Your example isn't exactly what I would call a big type. When I said a big type, I had in mind something at least 10-20 times the size of that.

Grey Alien

(Posted 2008) [#13]

Excellent research! It sounds like my question is answered!

Just to offer a bit more info about my scenario: I've got a method of a big type and it makes a whole bunch of local variables inside the method, I then pass these into my 2nd method which I'm calling in a loop. So what I'm comparing speed on is a) local variables in a method passed into another method, versus b) global variables used in first method (instead of local) and then not passed into second method which simply accesses the same variables. I was wondering if the local variables created in the first method would somehow be faster to access than globals even though they get passed into a function (due to them being in the CPU registers or immediate cache). Having a SUB-FUNCTION (or Macro) capability in BMax would certain resolve this issue as I bet local variables not needed to be passed into the second method *would* be the fastest technique.

(which you might want to turn into temporary Local variables within the function for even faster utilization)

This raises another interesting point I was thinking about (must be my old assembly days making me think like this (but only for heavily used functions where it's worth it on a modern PC)). Anyway, if I was just using globals once or twice in a function/method I wouldn't bother making a local copy (assuming it would be held in registers or the STACK), but if I looped and used it a lot, sure a local copy should be faster. A related question is if I have too many local variables (say more than 16 (how many registers does a modern PC have anyway?)) then would those local variables stop being as fast? Probably, but maybe if a stack was being used, it would still be faster than globals in "far away" memory land...

One thing I also do is make an "alias", so if I use a type a lot in a function that is perhaps a field of a type which is a field of a type and so on (e.g. MyBigType.SmallerType.ContainerType.TinyType), then I make an alias to it. Partly to avoid typing all that out, and also because I wondered if accessing TinyType through all those pointers was slower than having a single alias type pointer.

As a matter of interest, years ago in DOS I wrote two C programs (mini games actually), one used Structs and one used Arrays and I compared the speed. The arrays won by a large margin. I understand that compilers maybe more optimised for classes now (and maybe CPUs are?) than Borland Turbo C++ was then, but Arrays are probably still faster. It's just that arrays are pretty much impractical for large games (although I still use a few Global arrays for some parts of my games - old habits I guess)

Interesting end post about cpu cycles and the like, thanks!

.
.
.

I also agree with what Gabriel is saying in part 1 and 2 of his last post although I would add that I'm pretty sure that different areas of memory have different speeds also (and stuff about RAM page boundaries etc), so if a type is big enough, some fields may be in the registers, some may be in the stack, some may be in some faster (or cached memory) and some may just be sitting in slower memory. That's my (possibly false) understanding of it. Oh yeah, and my big type is like 1000+ lines with tons of fields and methods...but because I'm not passing it into anything (and because I'd only be passing a memory address, not an elephant through the eye of a needle) it doesn't matter anyway.

ImaginaryHuman

(Posted 2008) [#14]

Regarding aliases, yes that'll be faster I believe because each of those variables inbetween the dots e.g. mytype.mytype2.mytype3 whatever can be flexible so it has to look at each of those pointers and jump to the next one - 3 pointers being accessed at least before getting to the end type, whereas a simple mytype=mytype2.mytype3.mytype4 takes you straight there.

I'm not sure if using locals and then passing them as parameters is faster than not passing anything. I guess it depends how BlitzMax might see that scenario and whether it optimizes it. If not, it probably copies them into a temporary variable, but maybe using Var will help? Never really tried that because I was more interested in not passing anything.

I make local copies of pretty much anything that gets accessed more than once.

Modern CPU's, not sure of the register count but it's a lot more than the 16 you'd get on the Amiga. Like maybe 32 or even 64. I do wonder also what happens when they run out, so where possible I still try to keep algorithms from being too big and complex.

Also arrays, or to put it another way memory access with pointers, is always going to be the fastest way, faster than any linked list of types. The only drawbacks are needing to preallocate/premeditate how much space you're going to need ahead of time, as a `chunk` of memory, what to do when you fill up that space, and how to add or remote/insert elements of data efficiently.

One method I came up with is to keep a counter of how many items are stored in the array. When you add an item you increase the counter. When you want to remove an item you subtract the counter and move whatever item is at the counter position (the end) and overwrite the item you're removing - it's sort of an automatic defragmentation. That works ok if your data doesnt need to stay sorted. Inserting can be done similarly by copying the item from the insertion position to the end and then overwriting it with the new item.

The main problem comes with resizing or running out of space. What I do actually is I made a linked list of arrays of objects, and then each item I want to refer to I do with a `link` pointer and an array index. It then ends up being two jumps to get to an item instead of one, but then you can transcend the boundaries of how much space is allocated. You simply add another chunk of space by adding on another link in the list containing a new array. When you get to the end of an array you just jump to the next link and the start of its array. This way you get some of the benefits of the expandability of linked lists plus the processing efficiency of being able to go through most items as an array. After all if your data is in the same sequence most of the time or the sequence doesn't matter there really is no point having the overhead of previous/next link pointers for every single item, and to have to go through them for every item you access.

Grey Alien

(Posted 2008) [#15]

Thanks for the feedback.

The only drawbacks are needing to preallocate/premeditate how much space you're going to need ahead of time, as a `chunk` of memory, what to do when you fill up that space, and how to add or remote/insert elements of data efficiently.

I either make the array big enough to cover the maximum case scenario for the game (and maybe code a limit in-game if required) or have it so that it's a certain size, say 100 big, then when it goes over that, I add another 100 (not 1), and then you only need to grow it in chunks instead of each time you add an element (much faster!)

Regarding your last paragraph, yes I'm wondering if my particle system, which uses a TList to store the particles, should write a pointer to each particle in an array so that the array can be looped through quickly instead of the list. In fact, it should probably be an array of particles anyway, and not even use a list. I was thinking of making a TFastParticle type which uses arrays instead of lists and maybe has a reduced set of "fancy effects" so there are less "IF" statements checking for effect flags. I could even have TFastParticle1, TFastParticle2, TFastParticle3 etc which all do slightly different things. Bit crappy from an OOP point of view (unless I could use polymorphism to do it, hmm) but nice and fast for each special case.

At the end of the day, that's what it boils down to: Do you you write special harder to maintain code for situations where it needs to be well optimised? This would be a non-issue if CPUs were uber-fast but they are not fast enough yet...

ImaginaryHuman

(Posted 2008) [#16]

Yah well, oop is really nice but it is not necessarily the most efficient approach. Sometimes a good bit of procedural code will be much faster. Neither of the two paradigms have all the bases covered, they're just different sides of the same coin.

Compared to the old days of a 68040 CPU at 25Hz, which was only 19 MIPS, I would think that today's 2GHz+ cpu's would be considered fast! Maybe it's how people use them, or how sloppy the code has become, or how abstract.

I would definitely opt for arrays for particles if you want efficiency.

Grey Alien

(Posted 2008) [#17]

Well I'm using a modern CPU to punt around hundreds of particles and in the old days you'd have to optimise if you wanted to use >8 sprites :-) Basically special effects have got bigger and framerates have improved so lots more logic processing (and graphics blitting) is required.

AlexO

(Posted 2008) [#18]

No one said it was, did they? I Know I didn't.

Just the OP's question seemed to imply it.

I ask this because I wonder if the compiler is optimised to use params quicker than fields of a potential big type or vice cersa?

But it all seems well and good now, nothing to see here. :)