Max number of particles possible

BlitzMax Forums/BlitzMax Programming/Max number of particles possible

BinaryBurst

(Posted 2012) [#1]

How many particles are possible to run at 60 fps? O.o

A particle should take 12 bytes :
- 1b red
- 1b green
- 1b blue
- 2b x
- 2b y
- 2b sx (speed on the x axis)
- 2b sy (speed on the y axis)

So that means that on a computer with 2GB of RAM you should get at most 178956970 particles which is mostly 180 million particles.

So is it possible to update so many particles at 60 fps?
Just curious... :)

Oh, it doesn't have to be run in windows... (or be constrained by anything on a CPU)
No GPU please..

Last edited 2012

ima747

(Posted 2012) [#2]

It would depend on a great many factors. here's a few:
CPU/GPU: Are you talking drawing them with modern hardware (which will have to pass through the GPU eventually) or just store them?
bits and ram: Is it 64 bit or 32 bit OS? how much ram? (assuming you just want to store them)
Storage approach: How are you storing the values that make up a point? in an object, in lists, in arrays. What are the individual values, 4 byte standard ints, or actual bytes, are you using floats for the speed and or the position? any doubles/longs/other overhead? Don't forget the code running around all this data takes up some space and will have it's own variables taking up more space etc. If you're using bytes for everything then you can't display more than 65536 points because that's how many pixels are in a grid of 256x256 and you can't display sub pixels, and you can't give it a position any bigger because of the size of a byte. Different types of variables are also faster than others depending on the platform, e.g. ints are typically the fastest. so you'll get much more speed using ints but they take up 4x the memory space.
Processor speed/capacity: if you're updating them what are you changing, the position, the speed, the color, all? How computationally heavy are the updates, incrementing a color value or generating 3 new random numbers, etc.
Optimization approach: Assuming you're only changing the position and you have static color sets you could group them together which would reduce your overhead at the expense of potentially hitting list/array limits based on the architecture.
System and other overhead: Everything lives in or at least passes through RAM, if you have a 2GB of hard ram and you're running windows 7 you'll have a massively different amount of that free vs. XP. What about other programs? VRAM and the paging that that could bring about could have a HUUUUGE impact if you're pushing the hardware limits...

AdamRedwoods

(Posted 2012) [#3]

180 million particles in 17ms? If they are not stationary, you can't do it without GPU. Even then, their paths would have to be in a lookup table.

If they are stationary, you'll hit fill-rates:
each particle is 2 triangles=360 million triangles/17ms, 22 BILLION tris/second. how fast are modern gpu's in triangles/second?

I think the Quadro FX 5000 can do ~1billion tris/second, which isn't fast enough.

You would have to "cheat". camera frustum, distance occlusion, etc.
Even then the CPU overhead may be too much.

jkrankie and i got 100,000 sprites here:
http://www.blitzmax.com/Community/posts.php?topic=97398

BinaryBurst

(Posted 2012) [#4]

Integer arithmetics are used, simple speed deacceleration, fixed colors(just for simplicity), all cores used, ideally used without any os (the particle system itself is the os), arrays are fastest.

code example

for i=1 to n
   x:+sx
   y:+sy
   sx:-c1 'constant
   sy:-c2 'constant
'you can use c1,c2 to make the particle curve or run straight

next

No GPU. The cpu is directly writing the VRAM (sending the screen only)
This may seem very choppy but the result may be rewarding.

P.S. I personally like to think that the CPU is way faster than the GPU, even though the GPU is capable of making far more computations per frame BUT you can't use it as flexible as the CPU, plus the CPU is much easier to optimize. And think about it... even if the GPU had inifinite speed, it would never be able to work on more data than its VRAM supplies. :)

Last edited 2012

BinaryBurst

(Posted 2012) [#5]

@AdamRedwoods:
The particles are in 2D(for simplicity's sake)
BUT 3D particles do indeed take more because you have to do the projection (which mostly doubles the execution).
Oh and i got myself 130,000 2D particles (60 fps) on a 800Mhz CPU in BlitzMax . I don't have the code any more (I think...)

Last edited 2012

BinaryBurst

(Posted 2012) [#6]

I did some research and, unfortunately, i learned something very sad.
When you process a large set of data (>10 Mb) you hit memory latency instead of instruction latency -_-. It took me a while to figure that out but finally
I got this piece of code:

Save it as "a.asm"

format ms coff

section "code" code
Public update_particles as "_update_particles"
update_particles:	
		pusha
		
		mov  [t+ 0], dword pix
		mov  [t+ 4], dword pix
		mov  [t+ 8], dword pix
		mov  [t+12], dword pix
		
		movaps  xmm7, [mrow]
		movaps  xmm6, [scr]
		movaps  xmm5, [r100]
		movaps  xmm4, [t]
		
		mov ecx,[nb]
		sub ecx,16
l:				
		movaps  xmm0, [coor+ecx]	; get the x and y
		paddw   xmm0, [spee+ecx]	; add to x its sx (y its sy)	
		movaps  xmm2, xmm0
		
		movaps  xmm1, xmm6
		pmulhuw xmm0, xmm5			; divide x by 100 (y by 100)
		pcmpgtw xmm1, xmm0			; write 0 to x if x>=640 (0 to y if y>=480)
		pand    xmm0, xmm1			
		pmaddwd xmm0, xmm7			; transform from (x,y) to address (x*4+2560*y)
		paddd   xmm0, xmm4
		movaps  [t],  xmm0			

		mov eax,   [cols+ecx]		; get color
		mov edi,   [t+0]			; get address
		mov [edi], eax		 		; write color to address
		mov eax,   [cols+ecx+4]		;idem for second particle
		mov edi,   [t+4]
		mov [edi], eax
		mov eax,   [cols+ecx+8]		;idem for third particle
		mov edi,   [t+8]
		mov [edi], eax
		mov eax,   [cols+ecx+12]	;idem for fourth particle
		mov edi,   [t+12]
		mov [edi], eax

		movaps  [coor+ecx], xmm2

		sub  ecx,16
		jnz  l

		popa
		ret
		
section "data" data align 16
	
mrow:	dw 4,2560,4,2560,4,2560,4,2560
scr:	dw 640,480,640,480,640,480,640,480
r100:	dw 655,655,655,655,655,655,655,655
t:		times 16 db 0

public coor as "_coor"
public cols as "_cols"
public spee as "_spee"
public pix  as "_pix"
public nb   as "_nb"

coor:	times 20*1024*1024 db 0
spee:	times 20*1024*1024 db 0
cols:	times 20*1024*1024 db 0
pix:	times 4*640*480 db 0

nb:		dd 0

Save it as "a.bmx"

	Import "a.asm"
	Extern 
		Function update_particles() = "update_particles"
		Global coor:Short = "coor"
		Global spee:Short = "spee"
		Global cols:Short = "cols"
		Global pix :Byte  = "pix"
		Global nb:Int	  = "nb"
	EndExtern

	Global pixmap:TPixmap = New TPixmap
	pixmap.pixels   = Varptr pix
	pixmap.width    = 640
	pixmap.height   = 480
	pixmap.pitch    = 2560
	pixmap.format   = 6
	pixmap.capacity = 1228800
 
	Local cor:Short Ptr = Varptr coor
	Local spe:Short Ptr = Varptr spee
	Local col:Short Ptr = Varptr cols
	For i=0 To 20*1024*1024/2-2 Step 2
		cor[i+0] = ( 320+Rnd(-10,10) ) * 100
		cor[i+1] = ( 240+Rnd(-10,10) ) * 100
		
		l# = Rnd( 0.2 )
		a# = Rand( 360 )
		
		spe[i+0] = ( l*Cos(a) )     * 1000
		spe[i+1] = ( l*Sin(a) )     * 1000
		col[i+0] = Rand(65535)
		col[i+1] = Rand(65535)
	Next
	 
	t=MilliSecs()
	For i=1 To 1e9
	Next
	Print "Reported core speed: "+(1000.0/(MilliSecs()-t))+" GHz"
	HideMouse()
	timer:ttimer=CreateTimer(60)
	Graphics 640, 480 
	While Not KeyDown(key_escape)
		t=MilliSecs()
		
		ClearPixels( pixmap )
		nb:+2000 ; nb=nb/16*16 'MUST BE MULTIPLE OF 16!
		update_particles() 
		DrawPixmap( pixmap, 0, 0) 
		
		'SetColor(0,0,0)
		'DrawRect(0,0,100,30)
		'SetColor(255,255,255)
		'DrawText( (1000.0/ms), 0,0)
		'DrawText( (nb/4), 0,15)
		Flip(0)
		
		If 1000.0/ms<60 n:+1
		If n>100 ok=True;Exit
		'WaitTimer(timer)
		ms=MilliSecs()-t
	Wend 
	EndGraphics()
	If ok Print "Max number of particles: "+(nb/4)
	Print "Press enter to close..."
	Input("")

This is the worst case scenario where you blend particles together and particles never dissapear. Yeah, practically the bottleneck here is the memory speed not the cpu speed. Anyways I got this:

Reported core speed: 1.58478606
Max number of particles: 718500

So sad... ;( (Oh, the two files have to be in the same folder.)

Last edited 2012

col	(Posted 2012) [#7]

Hiya,

Reported core speed: 2.45098042 GHz
Max number of particles: 1663500

SonyVaio VGN-FW31M

BinaryBurst

(Posted 2012) [#8]

@col Hello to you too. Did you run it for 4 times with all programs closed, because if you didn't you should get 50% more particles.

Last edited 2012

col	(Posted 2012) [#9]

Yes,

I always run speed test a number of times to get a fair result.

I'd personally optimize some those Max2D commands too.
Drawpixmap could be optimized a large amount by removing the function call and making some kind of inline version.

Commenting out those DrawText commands alone can get another 60% again:-

Reported core speed: 2.41545892 GHz
Max number of particles: 2883000

BinaryBurst

(Posted 2012) [#10]

:)) Yes I know. But I had to get something on the screen to tell them how many particles are there, if you know what I mean. Nobody wants to understand the code to check it is real. Anyways thanks ;)

Reported core speed: 1.58983515
Max number of particles: 747500

Last edited 2012

col	(Posted 2012) [#11]

I can see its real :)

Have you thought of inlining the drawpixmap and moving the whole loop to asm?

BinaryBurst

(Posted 2012) [#12]

Well the loop doesn't take so much processing (<0.1% of the whole program).
I don't yet know how to get rid of the memory latency. The actual calculation of the particles takes 2 cycles/particle.

Last edited 2012

col	(Posted 2012) [#13]

I thought something was strange as it takes a while to compile to the .asm file.
Why would i be getting a 38.4Mb a.asm.release.win32.x86.o file? Is it unrolling the loop?

Last edited 2012

BinaryBurst

(Posted 2012) [#14]

No it isn't. It just defines the blank 32 mb data that the program will access. Don't know why it does that... but you can do some tricks to get over it.

BladeRunner

(Posted 2012) [#15]

I don't get the point of the whole thingie here.
You are asking how many particles can be drawn without giving any further info but the size of the particles data.
No info on the CPU,RAM etc. pp which would alter the results massively dependung on the machine running the stuff.
This is why you can't get a simple answer.
Furthermore you want to disregard factors like an OS, which complicates things too.

But I think GPU would be the winner in this race as it can compute lots of particles at once. Modern GPUs are optimized just for those purposes, I can'T imagine a CPU beating that.

Also you disregard the enourmus size of the screen needed to display such lots of particles.
Otherwise it is completely senseless to go for such ammounts of particles.

ImaginaryHuman

(Posted 2012) [#16]

Definitely a GPU-based particle system will massively outperform anything the CPU can do even with a quad-core. Using textures to store particle positions etc and with a vertex shader to reposition the geometry, you could easily do `millions` of particles.

The main problem with using the CPU is there is now a separation between the cpu's main memory and the memory used to display the final image from video ram. Crossing that boundary is a relatively slow process (10 or more times slower than drawing straight from video ram, probably 100's on a higher-end system). The CPU in old-school software-based blitting etc could probably do several hundred thousands particles but getting it to be displayable is a big bottleneck.

Last edited 2012

GW	(Posted 2012) [#17]

Did you try Point Sprites?

http://blitzbasic.com/Community/posts.php?topic=90633#1031313

BinaryBurst

(Posted 2012) [#18]

@GW I got for the first test: 5000 part at 60 fps and for the second (the one with the magnet) 8500 part at 60 fps.

@ImaginaryHuman You seem to not take into account the ---> memory latency <--- which is the actual bottleneck for both CPU and GPU. In my example you can do 30 more calculations per particle and still get the same number of particles. That's the same for the GPU. If I remember right the memory latency for the CPU is 1 cycle L1, 4 cycles L2 and for the GPU(NVidia GTX 680) is 20 cycle L1, 162 cycle L2. Sooooo, THE POINT IS that it DOES NOT MATTER (in this type of test) how much processing power you have. It will NEVER be able to process more data that it can get.

Last edited 2012

Noobody

(Posted 2012) [#19]

When you process a large set of data (>10 Mb) you hit memory latency instead of instruction latency -_-

If I understand your code correctly, you're basically updating a lot of particles in your asm code while writing them to a pixmap as you go along.

Of course the drawing-to-pixmap part will suffer from a massive memory bottleneck - writing a pixel to the pixmap at the particle position results in pretty much random access to the pixmap memory, leading to countless cache misses. Since your pixmap is too big to fit in the lower-level caches, you'll end up constantly evicting cache lines a later iteration of the loop would use, so the CPU will have to take the round trip to main memory a lot - which is slow!
Instruction reordering and instruction level parallelism on today's CPUs will help reduce the performance impact a little, but performance wise this is still not a very good solution.

What's really the problem here is that you're doing the actual rendering on the CPU and then draw the results on screen using DrawPixmap (which is pretty much creation of a new texture buffer from scratch, followed by a DMA from CPU to GPU memory - in each frame!). The GPU sits there idle while the CPU does all the work.

What you'd normally do to get a really fast particle system is to create a VBO, fill it with the initial particle positions and then hand it to the GPU. From then on, a vertex shader
1. rendering into an FBO mapped VBO
-or-
2. with transform feedback
will take care of the particle transformations, feed them into a fragment shader and then process the next frame. The data does not touch the CPU after the initialization; it is always kept in GPU memory, processed from there and fed back into it.
Particle systems are pretty much optimal for good rendering performance, since geometry is predictably uniform and simple, and in your case, does not even require texture lookups, which pose a bit of a bandwidth problem.

You seem to not take into account the ---> memory latency <--- which is the actual bottleneck for both CPU and GPU

You forget that geometry processing on the GPU is much much different from how you'd do software rendering on the CPU.
When you feed a bunch of polygons into the pipeline, the GPU will first read from the index buffer, determine the vertices that should be processed next, assign shader cores to handle them and proceed with the next indices. After the shader cores have finished executing the vertex shaders, the results are fed into a vertex cache, upon which the geometry assembly unit will look at the indices and the vertices in cache, bundle them into a renderable primitive, send them to the rasterization stage (which consists of a whole lot of preemptive clipping/culling, hierarchical depth/stencil tests etc. - basically magic). The rasterizer will generate a whole lot of fragments, send them back to shader cores, has them do the work and then finally writes back to memory.

First of all: Vertex processing for quads is cache optimal. Memory access in the VBO is linear, there is no cache pollution, no false eviction and the only cache misses you get are on the cold cache (i.e. when accessing the first vertex). After shader cores have been assigned, things get even better - each streaming multiprocessor (ndivia terminology, btw) has its own cache, the involved data (vertex attributes) are perfectly local and the results are not written back to memory, they are fed into another cache (fast!). If one vertex has a cache miss, it doesn't matter. All the other hundred cores (or how many your GPU has...) will keep working and process other vertices.
Each stage of the pipeline has its own specialized cache and the GPU will do its best to order accesses to memory to utilize full bandwidth.

Point is: All of this happens in parallel. It doesn't matter if processing a vertex/fragment causes a memory fetch (or a cache read), everything else keeps working. In your CPU example, the loop *will* stall if you end up thrashing your cache due to random accesses to the pixmap (although the effect, as said before, is slightly reduced thanks to instruction reordering and instruction level parallelism).
It's always important to remember that the GPU is optimized for throughput, not latency. If a pixel takes a bit longer to process, it really doesn't matter! As long as you can chuck out the 30-50 Gigapixels/s fillrate, you're alright.

I mean, we're already at realtime fluid simulations with several hundreds of thousands of particles. I'm sure that a few million particles simply moving across the screen in a linear fashion are feasible :)

BinaryBurst

(Posted 2012) [#20]

we're already at realtime fluid simulations with several hundreds of thousands of particles

Thank you for detailing the workings of the GPU. Let me know if I get the quote right: "A modern GPU can process around 1000000 fluid particles"
If this is right then the CPU example can be adapted to that kind of simulation without lowering the number of particles. My CPU which is pretty darn old can handle 750000 realtime fluid particles. Why is that possible?

Well take my example and add this to the asm code:

                movaps  xmm0, [coor+ecx]	; get the x and y
		paddw   xmm0, [spee+ecx]	; add to x its sx (y its sy)	
		movaps  xmm2, xmm0
;----------->>>>>>>>>> add this		
                movaps  xmm1, xmm6
                movaps  xmm1, xmm6
                movaps  xmm1, xmm6
                movaps  xmm1, xmm6
                movaps  xmm1, xmm6
                movaps  xmm1, xmm6
                movaps  xmm1, xmm6
                movaps  xmm1, xmm6
                movaps  xmm1, xmm6
                movaps  xmm1, xmm6
                movaps  xmm1, xmm6
                movaps  xmm1, xmm6
                movaps  xmm1, xmm6
                movaps  xmm1, xmm6
                movaps  xmm1, xmm6
                movaps  xmm1, xmm6
                movaps  xmm1, xmm6
                movaps  xmm1, xmm6
                movaps  xmm1, xmm6
                movaps  xmm1, xmm6
                movaps  xmm1, xmm6
                movaps  xmm1, xmm6
                movaps  xmm1, xmm6
                movaps  xmm1, xmm6
                movaps  xmm1, xmm6
                movaps  xmm1, xmm6
                movaps  xmm1, xmm6
                movaps  xmm1, xmm6
                movaps  xmm1, xmm6
                movaps  xmm1, xmm6
                movaps  xmm1, xmm6
                movaps  xmm1, xmm6
                
;------>>>>>>>>>>>
		movaps  xmm1, xmm6
		pmulhuw xmm0, xmm5			; divide x by 100 (y by 100)
		pcmpgtw xmm1, xmm0			; write 0 to x if x>=640 (0 to y if y>=480)
		pand    xmm0, xmm1			
		pmaddwd xmm0, xmm7			; transform from (x,y) to address (x*4+2560*y)
		paddd   xmm0, xmm4
		movaps  [t],  xmm0			

		mov eax,   [cols+ecx]		; get color
		mov edi,   [t+0]			; get address
		mov [edi], eax

What do you get? The same number of particles.

Last edited 2012

Noobody

(Posted 2012) [#21]

Let me know if I get the quote right: "A modern GPU can process around 1000000 fluid particles"

This is quoted correctly, yes (I'm referring to the sort-of-famous SPH demo on the GPU with 128k particles).

If this is right then the CPU example can be adapted to that kind of simulation without lowering the number of particles.

Could you elaborate on how you arrive at this deduction? If this general statement were true, then the entire idea of scientific computing using dedicated GPGPU architectures (such as Tesla) would be completely pointless.

My CPU which is pretty darn old can handle 750000 realtime fluid particles.

Since this is the same number you posted above as the result of your test, I assume we have a misunderstanding. By fluid particles, I don't mean the 2D straight-moving particles we have in this thread, but particles that behave like a fluid (which you might have guessed from the name). Something like this
I wrote a C++ SPH implementation about two years back, had it run multithreaded on 8 cores (on an i7) and got to about 10'000 particles at 10FPS (rendered as points, mind you), whereas this video shows 128'000 particles with the same simulation scheme (SPH) running at ~30FPS with impressive rendering quality.

There is the difference in experience levels between me and an nvidia dev of course, but I hope you can see the differences in magnitude of computational capability.

What do you get? The same number of particles.

As expected, yes! In my lengthy post I explained why your code was memory bound and not instruction bound and how the GPU cirumvents this problem by specialized caches, controlled access patterns and parallelization. Again, this is why I strongly believe a good GPU implementation could render a whole lot more particles than this implementation on the CPU.