Optimising Particles - Pooling etc.

BlitzMax Forums/BlitzMax Programming/Optimising Particles - Pooling etc.

Grey Alien(Posted 2008) [#1]
Using Arrays instead of Lists
=============================

Currently my particles are added to a list and the list is iterated through to draw them, and when they fade away of go off-screen they are destroyed (removed from the list so the GC can clean them up).

Sounds fine and is easy to manage, but people have said before that I should be using a particle pool i.e. a pre-made array. I presume this is for two reasons:

1) Iterating through an array is quicker than a list because you simply access the next slow instead of having to read list links which is slower due to the extra steps. I assume this would make more difference if there were a large number of particles.

Problems I see with this: If I create 1000 particles which disappear at different times, when I get to say 100 particles left, my list will iterate nice and quickly but with an array I'd still have to loop 1000 times and check each slot to see if the particle is still "active" - that could be slower than the list iteration! I could shuffle up the particles when one was deleted but that would be really slow and impractical I'm sure.

2) The TParticle objects can be pre-made and associated with each array slot so that instead of having to create a new TParticle object to add to a list (slow due to mem allocation), you simply set (overwrite) the parameters of the existing TParticle at the current array slot and make it "active".

Problems I see with this: If my Particle type has many different fields that do different things I couldn't just set the few fields that I'm using this time, I'd need to clear all the fields to default values and then set the values that I wanted to actually use. This could be slow.

So am I missing something more obvious about why particle pooling in an array is better than using a TList? Perhaps even with the problems I've listed above, using an array is still quicker?

"Normal" Flag?
==============

On a related topic, my Logic() and Draw() methods for TParticle are quite complex due to all the different things the particles can do (lots of If statements). Perhaps I could optimise these by having a "normal" flag which when =1 I just draw a particle nice and simply and bypass all the If statements? At the moment the normal particle is drawn last after a ton of If statements which slows down the majority of particles.

Polymorphism
============

Of course it would probably be better OOP to have different particle classes starting with a really basic one and going up to more advanced ones and then I can use polymorphism so that when iterating through the list (or array) or particles, Logic() and Draw() are called for the correct particle class. This would keep the simple particles fast.

However, what if I have say particle type B which handles scaling and particle type C which handles rotation. How can I make a particle type D which handles both without some blatant copy/pasting from B and C (because the scaling and rotation code is not in the base class A)? Is this what multiple-inheritism is for?

If anyone has any insights, I'd be very glad to discuss them. Thanks!


degac(Posted 2008) [#2]
I'm using a particle engine based on a pre-initiated array. Creating and destroying object is not a good practice.
I'm using a flag (status=1 particle is live, =0 is dead and can be 'reused').

Problems I see with this: If I create 1000 particles which disappear at different times, when I get to say 100 particles left, my list will iterate nice and quickly but with an array I'd still have to loop 1000 times and check each slot to see if the particle is still "active" - that could be slower than the list iteration! I could shuffle up the particles when one was deleted but that would be really slow and impractical I'm sure.


I resolved the problem using a while..wend with variable-limits.
Every time I need to use a new particle the part_counter is increased, and last_part can starts from 0 or from an higher value.
Maybe it is not the best solutions, but it seems to work...

Here a little extract of the 'CreateParticle' Function


When the particle is 'dead' I set its status to '0' and last_part=0.
Technically I could check if last_part is > 0, but I'm lazy...


JoshK(Posted 2008) [#3]
Use an array for position, an array for velocity, etc., and keep the particles in a constant loop.

I actually ended up updating particles at a fixed step interval, and then interpolating between the previous and next positions every time they get rendered. It was much easier to keep them in a constant loop this way, and it worked well with the physics system.


dmaz(Posted 2008) [#4]
here is a pooling TList
http://www.blitzbasic.com/Community/posts.php?topic=62667#700197

I didn't get -any- interest in this when I posted it but it works great for me. the one thing is you can't mix types. the example at the end should give you enough explaination.

all this does though is address the memory allocation performance, it does not do anything for the rest of your concerns... but between lists and arrays.... I think the only advantage for arrays is speed. the ease of lists trump the slight gain you get as most often you'll find other areas that will get you better results.


Grey Alien(Posted 2008) [#5]
degac: OK thanks, so you aren't iterating the whole array just a part of it. Problem is if the particles are being killed off randomly you might still need to iterate the whole list I guess.

Leadwerks: So go totally array-based for the all the particle properties? I guess you are suggesting that because you've been using non-OOP Blitz3D for a while. It would undoubtedly be faster though. For an area of a game that needs to be so highly optimised, perhaps it's OK to go non-OOP. That's how I did my particles for my BlitzPlus games (and BBasic2, and Assembly etc)

dmaz: Thanks for the link. That's an interesting approach, it has the advantages of list management but using pre-created types in a pool, so addresses one of my original issues.

So in conclusion so far, it seems that there isn't any neat obvious solution I've been missing right?


JoshK(Posted 2008) [#6]
Problem is if the particles are being killed off randomly

This should never happen. They should have a constant lifetime, and should always be in a regular cycle. You should have an array for each particles "lifetime". You can add a "hidden" array to hide particles (like turning the emitter on and off), but they should keep cycling, even if they are not visible. If a particle is hidden you can skip everything except the life/time updating.


Dreamora(Posted 2008) [#7]
Constant lifetime is nice.
But:

1. Particles can go offscreen -> useless to update them
2. Particles can have alpha <= 0 -> useless to update them
so in both cases they better get killed early unless you want to ensure that low end system (casual systems) don't play your game.


AlexO(Posted 2008) [#8]

"Normal" Flag?
==============
....

Polymorphism
============
....




Honestly, if speed is your issue I think you could only go so far with using arrays vs lists. Getting a fast update speed would be only half the battle I would imagine. When it comes time to actually draw the beast, you'll hit a wall on performance sooner than you think if you keep to blitzmax's standard draw commands (TImages with DrawImage and the like). Someone on these forums is making some sort of particle editor for blitzmax (made in blitzmax also?) that may have addressed the rendering issue of tons of sprites. I think in the end you may have to do your own calls to DX or OpenGL (I believe it's referred to as 'batching'). I think the name of the app was 'Splash' or something similar? Contacting him might provide some insight on speeding up particle simulations.

*EDIT*
was called splash:
http://www.blitzmax.com/Community/posts.php?topic=75201


JoshK(Posted 2008) [#9]
1. Particles can go offscreen -> useless to update them

They can come back onscreen as well. So you'd better update them.

2. Particles can have alpha <= 0 -> useless to update them

Particles should not reach an alpha of zero until their lifetime is complete. I make mine fade out over the last 25% of their life (and fade in over the first 10%).

The cost of updating a single particle should be negligible. The fillrate of drawing all those blended layers will become a bottleneck long before the CPU processing does.

You should think of a particle emitter as a single dynamic mesh, not as a collection of parts.


Dreamora(Posted 2008) [#10]
GA is talking of 2D not 3D and talking of casual systems, not the unrealistic target systems you target with your engines Josh ... Not all have dual to quad cores with high end engines. Most here actually target the more real indie - casual market to really sell an end product, not middle ware engines where you can afford to "ignore" that little fact as this won't be your problem in the end.

So fillrate is worlds earlier a problem than you expect them to be as onboard cards have a very low maximum fillrate.
And if it would be negligible, I don't think GA would "waste" time on improving it, don't you?

I agree on the alpha point but if you have alpha creation variance this means "ignorance of the preset lifetime and additional calculation on per particle basis" to adjust the lifetime. quite cpu intense if we assume target systems of 1000mhz single core P3 and 1000-2000 particles.


Bremer(Posted 2008) [#11]
I am using lists in my system, and for anything I have used it for, its plenty fast, and its so much more nice to work with than arrays. I have not really done over 2000 or so particles, but in a casual game do you really use much more than that at any given time? I am going to stick with lists because at the moment I don't feel that I am gaining that much using arrays and its just more of a hassle coding wise in my view.


Dreamora(Posted 2008) [#12]
I'm using lists as well. 2 actually. One for the active particles and one for the pool which has a logic within to reduce the pool size over the time.

The performance gains from this approach to create - destroy were quite stunning, actually more than I would have thought back then. (can't offer numbers, did this test long ago when I wrote the initial version of Kamaya)


Jake L.(Posted 2008) [#13]
It makes absolutely no difference timewise using arrays or lists for your particles. Sure, sometimes arrays are faster, sometimes lists, depends on what you're doing and how. But all the speed tests run over thousands or millions of iterations to see a noticeable difference, so they're "academic" and not real life for a system with <1000 particles on screen (if you need more particles you're really doing something wrong).

I use lists (one for active, one for idle particles) for my particle engine. But there are many other things to optimize that have a bigger impact: The way you're drawing them and how you organize your textures, how you calculate them.

Just by optimizing your calculations and adding/removing if-then-else structures where useful you get a good boost.


JoshK(Posted 2008) [#14]
You're right, I don't know anything about hardware and performance.


ImaginaryHuman(Posted 2008) [#15]
For me I am using arrays, they are faster, and while it might be true that lists give you certain advantages it's really a matter of whether you think those advantages are more important than speed.

One thing you can try is what I call a `auto-defragmenting array`. You keep a counter of how many items are stored in the array - ie defined particles. This starts at 0. When you add an item you add it at the counter position and then add 1 to the counter. When you remove/kill an item, instead of trying to wipe it out or delete it you copy the item at position `Count-1` to overwrite the item you are deleting, and then subtract 1 from Count. In other words you're taking the item from the top end of the defined items and moving it to fill the gap created by the item you want to remove. The benefit of this is you then can loop from 0 to Count-1 and you'll know that there are no gaps and you don't have to detect dead objects. You really want to avoid having to skip over stuff if you can. One downside is that because you are defragmenting the space and keeping it compact, any kind of `sort order` goes out the window and your particles become somewhat random order. Presumably since particles move all over the place you probably don't usually care that this happens. The only issue might be that the user sees one particle suddenly jump in front of another.

I would definitely avoid ANY kind of memory allocation or deallocation at any time during the normal running of your game loop with regards to particles. You don't want the garbage collector to get involved at all, especially not on a per-particle basis. Memory allocation done many times is quite slow. Using a linked list means doing memory allocs/deallocs for every single particle which is not efficient for speed. It is convenient for flexibly adding or removing storage space, or for re-ordering a sorted list, but that is really the only benefits.

You can overcome the issue of a limited array size by either having a big array which you know will be big enough to store all possible co-existing particles, OR make yourself a linked list of arrays. You can then say, okay, each array represents a chunk of memory space and they're joined together by a linked list. This way you get some of the benefits of linked lists - you can easily add-on extra chunks of arrays and also remove them if not needed, whilst also not having the overhead of indirect pointers for every particle you visit. You want to be able to process *most* or almost all particles in as compact array format as possible. You don't want to be allocating memory every time you add a particle, or deallocating it every time you remove one - just do it once in a while when the number of particles exceeds the size of the current array. You can still do the auto-defragmenting, it's pretty simple and quick to cater for working across array boundaries, you just check for when the Counter is 0 and if so, go to the previous link in the listed list, look at the array there, and set Counter=Array Length.

I recommend also throwing out any kind of object orientation. OOP is great for a lot of things but not so great for efficient processing. Like linked list, having lots of types is going to add overhead and fragment things. Fragmentation leads to inefficiency. If you're going for speed, that is. Imagine how you'd do it if you didn't have any types or objects at all. You'd have several arrays, in parallel, storing one field per array for each object. If you want to go with a type number or way to identify different types of particles, fine. You can wrap several arrays into a linked list by making your own very simple linked list type - all you really need is a Previous, Next, and then your arrays as fields. When you want to expand, add a link.

I also recommend you try using function pointers. You can totally remove all of your IF statements by just having an array of function pointers, ie Field Func()[1000]. You could have separate arrays for your Draw() function and separate ones for whatever other types of function you need. Then all you need to do is set the function for each particle, and then just loop through the array(s) calling the functions. They will be automatically mapped to whatever functionality you want and whatever ways you want to draw your particles differently. There is a slight overhead to that, where you have to now read through an array in order to call functions, but that's probably faster than the worse case scenarios where you have to go through several IF tests to get to the function you want in realtime. Also note that you really don't have to `kill` or remove a particle, you could just hide it.

Also you may want to consider an `acceleration structure` if you're really looking at a larger number of particles, and especially if you are looking to do collision detection. For example evan a simple axis-aligned grid will dramatically speedup your collision tests. You have a 2 dimensional array of cells and then each cell has an array of linked list of objects which are within that cell, and then it's a matter of moving objects between lists as they cross cell boundaries. You then have to test the contents of the cell against several other cells (below left, above, above right, right, below right, and below), presuming you refer to the top-left corner of each object as its general position. You really have to make your grid cells at least as large as the largest particle, but you can also do nested grids of grids if you need to. There are also more advanced structures like bounding volume hierarchies and scenegraphs and all that.

Also ideally you'd want to get your particle data into some format uploaded like vertex arrays so the graphics card can render faster. But that's a more complex issue.

I hope this helps.


impixi(Posted 2008) [#16]
I’ve whipped up several rough tests:

EDIT 21-MAR-08: Tweaked and improved the tests a little. Added a test incorporating dmaz's list logic.

Array and index-related rendering:


Array and status-related rendering:


List and real-time particle creation / deletion:


Particle pool list and live particle list:



'Particle pool list and live particle list
'(Incorporating dmaz's non-BRL list logic):



JoshK(Posted 2008) [#17]
If you are doing particles on the GPU you will want these properties to be in arrays so they can be uploaded.


Grey Alien(Posted 2008) [#18]
Glad to see this topic has enlivened.

This should never happen. They should have a constant lifetime, and should always be in a regular cycle
At the moment I kill them off from my list when they are invisible but with the array I wouldn't kill them off, just set a flag on them as inactive. Therefore I was saying you'd still have to loop through 1000 but often for a large number of them you'd be checking the active flag and then ignoring them (not updating them). I wondered if that was less efficient than have a list which was smaller due to particles actually being removed from it.

I would imagine. When it comes time to actually draw the beast
Agreed but every little helps especially on slower machines. Yes that particle editor is by Jake L. and he has already kindly given me the optimised drawing code.

I have not really done over 2000 or so particles, but in a casual game do you really use much more than that at any given time
I'm hoping to use quite a few particles. Some effects like fire or magical lines of fire, or magical dust can use a ton of particles and so I want the engine efficient for that.

I'm using lists as well. 2 actually. One for the active particles and one for the pool which has a logic within to reduce the pool size over the time
Interesting idea, so this helps to remove the slow memory allocation for creating and destroying types over and over.

Just by optimizing your calculations and adding/removing if-then-else structures where useful you get a good boost
Agreed. Hence my question about polymorphic particle types which no one seems to have commented on.

@ImaginaryHuman:

One thing you can try is what I call a `auto-defragmenting array`
This is a good idea, it address my issue of having holes in the array. Of course there's a small overhead in filling the holes but then time is saved not looping the whole array and checking for active flags. Yes normally the draw order is not important but I wonder if it would look odd if one particle moved in front of another? It might be OK as what tends to happen is you set off a load which all die at the same time and whilst they are active you probably set off some more somewhere else. Also they are often drawn with light blend so it won't matter too much on overlap.

I would definitely avoid ANY kind of memory allocation or deallocation at any time during the normal running of your game loop with regards to particles
Yeah this is what I'm think I should avoid too.

You'd have several arrays, in parallel, storing one field per array for each object
Yes I realise this would be faster having coded like this in the past, it just seems so Old Skool and inflexible now...It still doesn't solve my issue of different particle types unless I have several different arrays: one for scaling particles, one for rotating ones, one that does both etc. Just seems messy in a non-OOP way. A I see you've suggested "I also recommend you try using function pointers." That's a good idea. I use function pointers for other things (user-defined functions added to a process) so I could use them for this too.

There is a slight overhead to that, where you have to now read through an array in order to call functions
If the array of functions has the same index as the particle arrays then a separate loop is not needed because you already know the index of the current particle that's being processed.

I'm not doing collision detection.

Also ideally you'd want to get your particle data into some format uploaded like vertex arrays so the graphics card can render faster
Yes I don't know much about this at all, just the idea in principle. It's something I'd like to find out more about as speeding up drawing would be great. Perhaps some of Jake L.'s code does this, I'll have to check again...

Thanks for the ideas.

impixi: Wow that's some comprehensive tests, nice one. I'll check them out. Interesting isn't much difference. How much % between the fastest and slowest? I assume that the list-based on with real time creation/deletion is slowest (my current method)

If you are doing particles on the GPU you will want these properties to be in arrays so they can be uploaded
Is there some kind of batch move of memory command for GPUs? So I could upload an array all in one go? Or do you just suggest arrays for quick uploading manually (via a loop) to the GPU?

Great discussion all!


TaskMaster(Posted 2008) [#19]
The auto defragmenting idea is neat. I think if you were going to use this, then you would want to go through your particle array from back to front when updating. That way if the last 10 die, you won't be swapping them needlessly.

For instance, lets say there are 250 particles in your array and you are updating front to back. When particle 240 dies, you grab 250 and put it at 240, then if it also dies, you grab 249 and put it at 240, then if it dies, you grab 248 and put it at 240, etc...

If you are going through back to front, 250 would just die, no swap, 249 would die, no swap, 248 would die, no swap, etc. It would really just be a minor time saver, but it would still help.

Also, the draw order thing could get weird, so I am not so sure about that...

Isn't a TList just a linked list? Why not use a TList with a set amount of maximum items, and when one dies, move it to the back of the List. Moving items in a linked list is fast. Then you would not have to go through the entire List, just to the first dead one then stop. Then the order of your particles would never change and possibly look funny on the screen.


tonyg(Posted 2008) [#20]
Does removing a link and adding it to the end of the same list recreate the link? If it does then it will add to the number of objects before GC is called.
If you *do* use the addlast method could a pointer be made to the first 'non-active' particle. When re-used the pointer moves to the next on the list.


JoshK(Posted 2008) [#21]
You can use a vertex attribute, or just use the extra texcoord sets to store your extra data. Then you upload it the same way you send a vertex array or VBO.

If you use a VBO you can render multiple instances of the same particle emitter pretty quickly, with only the particle processing overhead of one emitter.


Grey Alien(Posted 2008) [#22]
TaskMaster: Good idea going backwards. Also good idea about moving dead ones in the list if done cleverly as TonyG suggests to preserve the link to avoid the GC kicking in.

Leadwerks: Thanks. Sounds pretty cool, but at the moment is a bit beyond my tech comfort zone...


TaskMaster(Posted 2008) [#23]
Maybe mess with the source code for TList and add your own MoveToLast method that just changes the pointers around to move one from its spot in the TList to the end. Of course, you would have to make sure a lot of other things didn't get hosed up, like enumerators and whatnot.


dmaz(Posted 2008) [#24]
I'm using lists as well. 2 actually. One for the active particles and one for the pool which has a logic within to reduce the pool size over the time
Interesting idea, so this helps to remove the slow memory allocation for creating and destroying types over and over.

hmmm, this is what the link I posted above does for you automatically. in your code, you just do
list.AddLast()
it takes it from the pool and does your own clean routine *or* if there are none in the pool, it will create a new instance.
(you have to create and init and create function in your type as per the example)

but if you want the fastest and easiest to use.... your BEST bet is to embed a simple single (or double)linked list in your particle type. it's not a TList, all you do is have a global which points to the first particle. each particle points to the next one. that's it. this will be as fast as arrays. add a pool pointer and you remove the memory allocation.

now if you implement a double linked list then you can remove any particle at any time. if you just do a single list then you can only remove when you are normally processing the list (since you need to know the previous particle )which is all you need for particles.

here's impixi's example modified



tonyg(Posted 2008) [#25]
Hmmm. I posted this in the code archives a while back.
If I use it on any of the examples then it suggests that GC is really not an issue as everything posts either 0 or 1 ms.
I also tried it with GA's AOTMG game and most GC runs were 0/1ms with a single 12ms spike.
I created some code which didn't remove the TLink of a particle but just moved it in the list.
This also suggested GC calls taking 0/1 ms.
In my test case it might be because I have simply replaced an extra TLINK recreated for the particle with a holder for the lastlink. I add a print statement and the GC was running every 200 cycles (with the Print statement probably having a large affect on that) so it really doesn't seem to matter that much from the GC point of view.
Code is in case anybody can spot something wrong or finds it of use :

So, from what I have seen, it's going to be optimisation in rendering rather than logic that helps the most (as I think a few people have stated already).


dmaz(Posted 2008) [#26]
ok, I originally said in my msg 2 above that looping through a simple embeded list will be "faster" than an array. Since I hadn't tested it I decided to replace that claim with "as fast as". I though it would be faster since the loop would not have to do any pointer arithmetic. the following example seems to show that looping through the list is faster.


anybody see any problems with the above code?

when running, click run from the ide a bunch to get a feel for the timing.

[edit] changed code to put prints at bottom... didn't see a change though on my machine.


ImaginaryHuman(Posted 2008) [#27]
"but if you want the fastest and easiest to use.... your BEST bet is to embed a simple single (or double)linked list in your particle type. it's not a TList, all you do is have a global which points to the first particle. each particle points to the next one. that's it. this will be as fast as arrays. add a pool pointer and you remove the memory allocation."

That won't be as fast as arrays, you're reading a per-particle-pointer in order to get to the next particle which is not necessary with arrays and is no faster than using a linked list iterated through in one direction. You also don't remove memory allocation because every time you add another particle on the end of the chain you have to allocate memory for a new instance of a type. All you've really said is use your own linked list structure instead of BRL's.


dmaz(Posted 2008) [#28]
I think the above test begs to differ about the speed

You also don't remove memory allocation because every time you add another particle on the end of the chain you have to allocate memory for a new instance of a type


no... that's what the pool pointer is.... you only allocate at the beginning of your program as I did in the modified impixi example.

All you've really said is use your own linked list structure instead of BRL's.

yep, because BRL's is very general and has more options that slow it down for this type of application.


dmaz(Posted 2008) [#29]
you're reading a per-particle-pointer in order to get to the next particle

right, and that's all you're doing. with a for loop you have to read the pointer and then either add/multiply the pos/size to get the next particle.


impixi(Posted 2008) [#30]
I've updated my earlier post to include slightly improved code and a test incorporating dmaz's list logic.

In non-debug mode the 'array and status-related rendering' test is still the fastest for me, though the twin list with dmaz's list logic is a very close second.


Grey Alien(Posted 2008) [#31]
Hmm this is getting interesting. BTW, print really screws up timing tests so only used it at the end of a test, not during.

It would seem that logic is not a big problem compared to render times but logic becomes more important if you ar running it at say 200HZ (which I am) and on slower PCs which also have worse video cards, so it could still be worth using the fastest logic approach (plus it's cool).


ImaginaryHuman(Posted 2008) [#32]
I still would say arrays in general should be always faster because if you need a pointer for each particle then you have to read an extra 4 bytes per particle, unavoidably.
Arrays are fundamentally lower level than a linked list of any kind. A linked list is an abstraction layer, so it will incur a penalty.

But anyway, that's only one area of overhead - when it comes to your rendering code and how efficiently you do that, that's likely where most extra time will be taken up.

The size of your particles impacts performance too - larger means more texels to render. But fewer large particles may be faster than the equivalent area covered by more smaller particles - extra per-particle overhead.

One area you can really speed it up is with your choice of how to draw only what's on-screen.


Grey Alien(Posted 2008) [#33]
Agreed that fewer larger is better than lots of smaller.


JoshK(Posted 2008) [#34]
Unless your drawing routine is incredibly inefficient, culling offscreen particles will make absolutely no difference, and it more likely to slow down the program.


Grey Alien(Posted 2008) [#35]
Because all my particles all explode outwards pretty much, if they go off screen they are gone for good so it makes sense to kill them off.


JoshK(Posted 2008) [#36]
The overhead of processing those extra vertices will be virtually 0.


ImaginaryHuman(Posted 2008) [#37]
I disagree Leadworks. In a test with 10,000 particles where about 80% were off-screen it was significantly beneficial to only draw the particles that were on-screen and to not even visit the ones that were off-screen (except to move them). It probably depends on what you're using the particles for, how many you have, how many go offscreen and under what conditions. As usual there's probably no single approach that works best for all situations, because as usual a specific hard-coded single-solution approach will work best in a specific situation.


BadJim(Posted 2008) [#38]
One dumb trick is to draw half a dozen or so particles on one texture, giving the illusion that you are handling half a dozen times the number of particles. It can look quite convincing, although in most cases large particles would also look good.


Dreamora(Posted 2008) [#39]
yes and put an even higher presure onto low end machine as larger texture -> more bandwidth / higher overdraw if you still draw them above each other.
But you are right, if used correctly this helps to lower the consumption (graphically as cpu wise) significantly but its hard to create "good" clustered particle textures ...


Foolish(Posted 2008) [#40]
That's actually pretty clever. Ill have to try that.


JoshK(Posted 2008) [#41]
I disagree Leadworks.

Then you are wrong. Are you drawing each particle in a separate call?


Grey Alien(Posted 2008) [#42]
Yes I used that trick for a fireball effect in my Holiday Bonus game.

Are you drawing each particle in a separate call?
Yes I am but that's because I'm using BMax commands and had no real idea how to setup a render pipeline or mesh or whatever it's called in DX.


JoshK(Posted 2008) [#43]
That is incredibly inefficient.


ImaginaryHuman(Posted 2008) [#44]
Well, in my tests Mr Leadworks it is faster to not try drawing the off-screen objects than to do so, provided I don't have to visit every object to test if it's on-screen or not. I'm doing my own OpenGL code, but the only thing I'm not doing yet is a vertex array which may prove differently.


Grey Alien(Posted 2008) [#45]
That is incredibly inefficient
I realise that but it's the best I've got as I just use standard commands. Maybe as you've been using B32 for son long you figured out all the fancy DX stuff but I've never really looked into it. Should do, one day...


MGE(Posted 2008) [#46]
You can optimize the cpu logic all you want, but the majority of the time you're going to be limited more by what the gpu can do compared to the cpu.


Jake L.(Posted 2008) [#47]
Comparing "nextgen" engines stuff with BMax default rendering is fun, but it doesn't tell anything. While a "nextgen" engine won't run on most onboard GPUs and older cards (and ATI :D), most of us want to see their stuff run on as most machines as possible.

If I make a particle system for my GPU I don't care about rendering 1000 unseen particles. Hell, I even could pixel them with Plot and get 80fps. And guess it, updating my particles with a single call raise my fps from 900 to 1200. Unfortunately most onboard gpu's I've tested this with don't like buffered output and run slower with this method.

So, I optimize my engine to run as fast as possible on default office PCs with onboard Intel graphics. That's the minimum spec most of us want to serve, so that's the spec to optimize everything against. Anything wrong with this?


JoshK(Posted 2008) [#48]
Vertex arrays aren't exactly "next-gen".


Jake L.(Posted 2008) [#49]
No, they're not, but they're also useless, as they're slower with most onboard chips then rendering with single calls. On better GPUs they bring a major speed boost, but that doesn't matter as these machines are already fast enough without it.

All I wanted to say is: if you target casual gamers playing on office PCs or worse, you better check every "boosting-method" against Intel GME and other crappy chipsets, as this will be the "GPU" your game/app will face - optimizing graphics code on a 8800GT is pointless unless you're working on AAA titles, isn't it?


JoshK(Posted 2008) [#50]
as they're slower with most onboard chips then rendering with single calls

I can tell you with certainty that is not true.


tonyg(Posted 2008) [#51]
Ooooo this could be good. Leadwerks is going to supply some super Particle Optimisation trick... I think... maybe... soon...possibly?
I hope it's good as it has been one helluva build-up.


JoshK(Posted 2008) [#52]
There's not much to it, just create a bank, poke the particle vertex positions into it, and render it as quads. You don't need an indice array, just a vertex array and texcoord array.


impixi(Posted 2008) [#53]
I'd also like to see code examples for the mentioned techniques. Theory is well and good but implementation confirms it...

Like Grey Alien, my current knowledge-level confines me to using standard BlitzMax commands, but any DirectX- or OpenGL- specific optimisations would be interesting to see...


Grey Alien(Posted 2008) [#54]
Jake L. Didn't we test that sort of stuff in the app you sent me and got some gains on some machines in certain circumstances and others methods weren't as noticeable on slow machines or something?


ImaginaryHuman(Posted 2008) [#55]
I think for most BlitzMax coders, vertex arrays are an `advanced topic`, since you're talking about dealing directly with your own OpenGL or DirectX code to set up the arrays and render with them. I agree they are a speed boost and it's a good idea to use them but many people won't know how to.

I think if you're not going to use them, then you need to be careful to optimize elsewhere where possible - you should try to put as many particles or frames of animation on a single image if possible to avoid texture swaps, and use other algorithmic techniques like the putting multiple particles on a single image as a cluster.

I don't know that this is strictly a particle-systems-for-casual-games discussion, but it's good to keep a range of hardware in mind.


Jake L.(Posted 2008) [#56]
@Grey

Exactly!

@Leadwerks
I said this as a result of intense tests. At least the crap-chips I tested this on (Intel 965 and some older laptop gpus) run slower when using vertex buffers. If you don't change the buffer each frame (e.g. UI,text) vertex buffers are a little bit faster, but with something like particles (=quads changing coordinates each frame) they're not.

From my tests I can tell that optimizing textures (using the single surface approach we did since Blitz3D) is the most noticeable speed improvement for onboard graphics.


MGE(Posted 2008) [#57]
"I said this as a result of intense tests. At least the crap-chips I tested this on (Intel 965 and some older laptop gpus) run slower when using vertex buffers. If you don't change the buffer each frame (e.g. UI,text) vertex buffers are a little bit faster, but with something like particles (=quads changing coordinates each frame) they're not."

BINGO!


Grey Alien(Posted 2008) [#58]
Yes now I remember, single surface texture gave a reasonable speed boost but vertex buffers didn't make much difference especially when you had to change the coordinates of the particles as they moved.

you should try to put as many particles or frames of animation on a single image if possible to avoid texture swaps
Yes I'm going to make sure all my particles are on one texture. Although as for animated particles, BMax's loadanim image will load them in as separate textures so I really need to "manually" load in the frames from a single texture by making a special TAnimatedParticle type which stores the coords of each image from a single texture.


JoshK(Posted 2008) [#59]
Vertex buffers will not make any difference, and may actually be slower. Vertex arrays will be much faster.


Grey Alien(Posted 2008) [#60]
So what's the difference between a buffer and an array? A buffer is on card and you just change values in it? An an array is uploaded each time (presumably with new values?)


Dreamora(Posted 2008) [#61]
Vertex Buffer Objects are GPU stored yes. Benefit of them is especially if you want:
1. Have static geometry as they don't need to be resent on every usage
2. Have objects transformed by vertex shaders

But as with any stream, you can not just alter a value within it, you must replace the stream ie replace the VBO on the GPU -> if you do that every frame there is no reason to use VBOs, it will most likely have lower performance than with a non VBO (unless the GPU driver realizes what you do and opts it on its own to a vertex array)

Vertex Array and VBOs are the two most performant solutions. The other possibilities are nice but not that performant.

As you target low end machines, Vertex Arrays are clearly the way to go.


tonyg(Posted 2008) [#62]
So, if I have this right, a Vertex Array is an array of vertex information pass to OpenGL (is there a DX equivalent?) so that each 'segment' of vertex information does not have to be sent individually.
Is that right?
How do you do that with core/native Bmax commands?


Grey Alien(Posted 2008) [#63]
Dreamora: Thanks for the info. I can understand that. VBOs no good for particles but fine for say a tiled background that doesn't move much in a casual game. VAs better (faster) than using standard Bmax commands yes?

TonyG: Good question.


Dreamora(Posted 2008) [#64]
yes and no ... on intel onboard, VBOs most likely won't be faster simply because GMA900 for example don't have hardware vertex processing units so vertices must be processed by the cpu anyway
But yes generally they would be good for static stuff or stuff that is impacted by vertex shader (shader driven particle system for example)


Jake L.(Posted 2008) [#65]
With Vertex Buffers I meant Vertex ARRAYS (e.g. using glVertexPointer and similiar functions), not VBO.

Vertex arrays will be much faster.

"I want to believe!"


Grey Alien(Posted 2008) [#66]
Jake L. Ah so we were testing vertex arrays after all. Now I'm confused! They were the ones that surprisingly didn't make much difference when you changed non-static particles in them.


JoshK(Posted 2008) [#67]
Vertex buffers are MUCH faster for static geometry, but they don't have any advantage when the vertex data gets changed every frame.


ImaginaryHuman(Posted 2008) [#68]
Yah and if you're doing particles which are moving all over the place and possibly with vertex animation you aren't going to benefit much if at all from a vertex array - but I would think it might help to use one anyway?


Jake L.(Posted 2008) [#69]
@Leadwerks: That's what I said as I talked about particle engine optimization (like the threads' topic suggests), not about graphics theory in general.


tonyg(Posted 2008) [#70]
... right. I really want to understand what has been discussed and decided in this thread as Particle Optimisation is an interesting area.
There are Vertex Arrays : an array of vertex information sent to OGL to handle in one call rather than multiple calls.
There are Vertex Buffers : an area of Video RAM which holds vertex information and only really useful when the vertex information doesn't change too much.
Is that right?
.
Is there a native Bmax way of using Vertex Arrays?
Are Vertex Arrays available in both DX and OGL APIs?
(Is this the DX version of Vertex Buffers?)
.
On to the bits I don't understand :
If you use a VBO you can render multiple instances of the same particle emitter pretty quickly, with only the particle processing overhead of one emitter.

Is this some other use of VBOs that WILL help Particles as it seems to contradict the later suggestion that VBOs are better used for static data? e.g.
Vertex buffers are MUCH faster for static geometry, but they don't have any advantage when the vertex data gets changed every frame.

.
Further, many of the people who have responded with test results have stated that Vertex Arrays haven't made any real speed difference.
Is that right?
.
Finally, the suggestion is that single-surface will help. There is the TANIM code on these forums which loads a single texture and then uses SetUV to 'clip' that texture.
Is that what helped optimise the most?
Thanks for any responses.


Grey Alien(Posted 2008) [#71]
Good questions as I'm still not 100% clear. Regarding single surface, that can definitely help I believe and a special TAnim type that made use of single surface would be a good optimisation over standard TAnim (also for lower VRAM use too)


JoshK(Posted 2008) [#72]
Vertex buffers - fastest to render, especially when instanced, but slow to alter.

Vertex arrays - Slower to render than buffers, but updating makes no difference because the data gets sent from system memory to the GPU each frame anyways.

Vertex buffers can be faster for instanced particle emitters because the data only gets sent once from system memory to the GPU. So there is a small hit doing that, but once it is uploaded, you can render the buffer as many times as you want.


tonyg(Posted 2008) [#73]
So if you have, for example, a rotating star as a particle then Vertex Buffers won't help?
However, if you have a non-rotating particle where the vertices (?) do not change but it's draw position does then they CAN speed things up.
I think colour is passed in the vertex information (is that right?) so a colour changing particle would negate the effect. Would that also include alpha?
<edit> If that is all true then Vertex Buffers seem to be of limited use for Particles however fast. So...
Is there a DX version of Vertex Arrays?
Can Vertex Arrays be done with native Bmax commands?


Dreamora(Posted 2008) [#74]
Alpha to DX / GL is nothing else than the 4th color component.


tonyg(Posted 2008) [#75]
So the suggestion is a particle system where the particles change rotation, size, colour or alpha will not benefit from Vertex Buffers.

@Leadwerks, I must be missing something because your statement suggested Vertex Buffers *will* help.
If the vertex information gets sent to the GPU each time any of the above changes then isn't it of VERY limited use for any particle system?
.
If that is true it leaves Vertex Arrays as being the a method mentioned.
I have seen examples of Vertex Arrays in use with OGL but not DX.
Is there a way of using Vertex Arrays with DX?
Can it be done with native Bmax commands (OGL and/or DX) : I guess the answer to this is 'No'.
Some people stated that using Vertex Arrays doesn't really help that much.
There was some interesting discussion Tachyon
.
There does seem to be a lot of differences of opinion from people who have run tests and not seen much improvement and those who provide the technical explanation why things will be quicker.
If they ever got together for testing it would be interesting.


JoshK(Posted 2008) [#76]
For a 3D emitter, where the corners of each particle have to be positioned relative to the camera, you will need to use a vertex shader to draw instanced emitters, or else they will be oriented wrong.

2 way to render:

Single emitters
1. Poke values to vertex array.
2. Draw.

Instanced emitters
Advantages: Only have to update one particle emitter, regardless of how many instances are present.
Disadvantages: No interaction with scene or anything else except simple behavior, since they are all the same, requires vertex shader.
1. Poke values to vertex array.
2. Upload vertex array into vertex buffer.
3. Draw as many times as you like.

Instanced particles are ideal if you had something like 20 torches. You only have to update one torch emitter, but all 20 instances can be drawn quickly.


ImaginaryHuman(Posted 2008) [#77]
I don't think anyone here is thinking about making a 3D particle engine, we're all talking 2D using flat quads and no perspective rotations.


Dreamora(Posted 2008) [#78]
even with perspective rotations you don't need vertex shaders for correct rotation.

the rotation that is applied to the emitter node itself can be set that the emitter node faces the node again the same way as before (if we give the node a direction) and it will look the same as the original one as the "center plane" (if you would project the whole emitter onto a plane perpendicular to the line camera - emitter) will be oriented towards the camera again.


HrdNutz(Posted 2008) [#79]
check out http://www.blitzbasic.com/Community/posts.php?topic=76924

i just made simple dynamic vertex buffer prototype, which improves throughput performance. It needs work, but gives an idea of how to batch up geometry and send it down the pipeline using few calls. Hope it helps getting more particles on screen :P


Grey Alien(Posted 2008) [#80]
Hmm this is getting interesting now and also we've got a code example, woo!

Leadworks: OK thanks for the detail. As ImaginaryHuman says, I am talking about 2D particles in 2D space mind you. Every particle needs to have a unique x,y coord, and scale, and alpha and rotation. That's pretty much it. Some may animate and some may have a different colour (but I understand that alpha is just treated like a colour anyway). It's no good having multiple particles all at the same rotation or alpha or anim frame as it will look naff.


JoshK(Posted 2008) [#81]
I am not talking about multiple particles. I am talking about multiple emitters. It would work very easily in 2D.


MGE(Posted 2008) [#82]
I'm going to go on record and state my guess that any real world use of vertex arrays/buffers for dynamic scenery (particles, sprites, animated tiles, etc) isn't going to show any real speed increase in DX7 or OpenGL. (And if you're looking to render your background faster, just use larger textures. 1kx1k textures is not that big of a deal now days.)

It's not like it's going to take your existing 30fps game and all of a sudden make it render at a solid 60fps. If there is any increase in real world fps it will be marginal at best.

What IS needed for BlitzMax is a batched sprite render system that could be used for particles, sprites, tiles, etc. DX9 using the "Sprite Object" has new support for batching and it speeds things up dramatically.


tonyg(Posted 2008) [#83]
I am not talking about multiple particles. I am talking about multiple emitters. It would work very easily in 2D.

Blimey, I am understanding less and less.
If I understand correctly
- an emitter is a controlling object for particles.
- the vertex array for the particles for an emitter is sent to the API.
- the vertex buffer is used to keep this information for multiple use pf the emitter and all its particles (such as torches).
- the particles are likely to be scaled, rotated, coloured, alpha'd so the vertex data changes each cycle.
- the Vertex Array needs to be resent each frame so the vertex buffer gets changed each frame.
So, the use of Vertex Arrays still seem an obvious benefit (although those who have tested say it isn't that noticeable) but the use of Vertex Buffers, in the case of a 2D Particle system, still seems pretty limited.


JoshK(Posted 2008) [#84]
The main savings you would get with instanced emitters would be the savings in CPU processing, since you only have one emitter. Vertex arrays could be used to draw instances, and you would still get your main optimization this way. Vertex buffers are just kind of a detail.

You would only use a vertex buffer if you wanted to draw multiple instances of an emitter; they can actually be slower than vertex arrays if the data is updated each frame. There would be a number of instances past which a vertex buffer would be faster than an array, more than 1 and possibly as few as 2.