OpenGL Optimization

BlitzMax Forums/BlitzMax Programming/OpenGL Optimization

Tachyon(Posted 2007) [#1]
Greetings-

I was just contacted by a guy who profiled my game using OpenGL profiler, and he claims that I am hugely unoptimized with my rendering pipeline. The thing is, I am of course not communicating directly with the OpenGL layer so it's not really "my" pipeline- I load sprites (.png pixmaps) then DrawImage them to the screen in my UpdateScreen() function every loop, then flip it just like everyone else does. So, according to him, this is obscenely slow. Here is his message:

I have simply profiled your engine using OpenGL profiler, and I am a bit horrified. At first glance, it is a pretty obvious why everything is going so slow. You are using immediate mode to render every object in the game... i.e. glBegin...glEnd blocks of code. This is incredibly slow, this is obvious problem.

You need to optimise the rendering path of your code... using vertex array. glBegin...glEnd style code is about 1000 times slower than appropriate vertex array. If you don't believe me benchmark rendering 100,000 triangles with immediate mode and then using retained mode.

The kind of configuration the above user has should be enough to render a few hundred thousand fully textured and blended triangles, so actually your explanation is not really acceptable..

Even just using display list would be advisable as a bandage for the this gushing wound...

Don't think of this as a problem, think of it as a task to increase the size of your customer base.

Immediate mode vs Retained mode:
http://www.cs.utk.edu/~huangj/CS594S06/oglPerfGraphicsArc.ppt

Vertex Arrays:
http://www.opengl.org/documentation/specs/version1.1/glspec1.1/node21.html

Every frame you are making approximately 15,000 OpenGL calls.

Sample of the trace:
0.05 µs glColor4ubv({0, 40, 0, 255});
0.27 µs glBegin(GL_QUADS);
0.05 µs glTexCoord2f(0, 0);
0.05 µs glVertex2f(442, 533);
0.05 µs glTexCoord2f(0.8125, 0);
0.05 µs glVertex2f(494, 533);
0.05 µs glTexCoord2f(0.8125, 0.8125);
0.05 µs glVertex2f(494, 559);
0.05 µs glTexCoord2f(0, 0.8125);
0.05 µs glVertex2f(442, 559);
0.33 µs glEnd();
0.05 µs glColor4ubv({0, 40, 0, 255});
0.22 µs glBegin(GL_QUADS);
0.05 µs glTexCoord2f(0, 0);
0.05 µs glVertex2f(494, 533);
0.05 µs glTexCoord2f(0.8125, 0);
0.05 µs glVertex2f(546, 533);
0.00 µs glTexCoord2f(0.8125, 0.8125);
0.05 µs glVertex2f(546, 559);
0.00 µs glTexCoord2f(0, 0.8125);
0.05 µs glVertex2f(494, 559);
0.22 µs glEnd();
0.05 µs glColor4ubv({0, 40, 0, 255});
2.17 µs glBindTexture(GL_TEXTURE_2D, 83);
54.31 µs glBegin(GL_QUADS);
0.22 µs glTexCoord2f(0, 0);
0.22 µs glVertex2f(546, 533);
0.00 µs glTexCoord2f(0.8125, 0);
0.05 µs glVertex2f(598, 533);
0.00 µs glTexCoord2f(0.8125, 0.8125);
0.05 µs glVertex2f(598, 559);
0.05 µs glTexCoord2f(0, 0.8125);
0.05 µs glVertex2f(546, 559);
0.76 µs glEnd();
0.16 µs glColor4ubv({0, 40, 0, 255});
0.27 µs glBegin(GL_QUADS);
0.05 µs glTexCoord2f(0, 0);
0.05 µs glVertex2f(598, 533);
0.05 µs glTexCoord2f(0.8125, 0);
0.05 µs glVertex2f(650, 533);
0.05 µs glTexCoord2f(0.8125, 0.8125);
0.05 µs glVertex2f(650, 559);
0.05 µs glTexCoord2f(0, 0.8125);
0.05 µs glVertex2f(598, 559);
0.22 µs glEnd();

You could probably replace this with about 20-30 calls using deferred rendering APIs... it would be soooo much faster. Even changing to 16-bit color shouldn't be needed - actually I was surprised to see this.. it won't make much different on modern hardware as it is all geared for 32-bit data anyway.


My question would be- is there anything I could or should be doing to improve this? Is this a BRL issue, or even an issue at all?


Dreamora(Posted 2007) [#2]
Its no issue at all.
You are not creating a high end 3D game, so a few micro seconds make no difference.

There have been attempts and benchmarks of glbegin - glend against batched rendering and the result is that is isn't faster due to the nature of the 2D single plane of BM

if you really wanted to opt it, you would need to write your own 2D driver which has true depth (z value), then it is possible to get away with vertex arrays


If I wouldn't know it better I would say a unix geek contacted you as nobody else would have nothing more stupid to do than profiling other users apps ;-)


ImaginaryHuman(Posted 2007) [#3]
I think immediate mode is slow when you get past a certain threshold. I found that you need to be drawing about 20 quads in immediate mode before a display list becomes a faster method - mainly to do with the overhead from function calls.

I have heard of increases of about 2-3 times throughput from switching to a vertex array, but the idea of it being thousands of times faster is totally impossible.


Snixx(Posted 2007) [#4]
blitzmax as default uses a really bad way of "drawing", it basically just throws each "primative" to the card one by one,

Im using a 2D system i made for c++ that uses a blitzmax like syntax but i wrote a new renderer (d3d9 and gl drivers) that gives a huge increase (3x + on my hardware) using batching (primative and texture batching). And is how bmax should have been done in the first place.


Fetze(Posted 2007) [#5]
Batching requires sorting by texture respectively sorting by primitive type - if BlitzMax had to do this for you and still keep the order of your draw commands resulting in the same order on screen it'd have a lot more to do. Batching is, I think, something that shouldn't be implemented *generally* - but that should be possible to implement in certain situations.

DrawTexturedPolygon with an optional parameter for the primitive mode would do the job - you could do manual batching on both textures and primitives: Create your own "software Vertex Array", pass it to the function and it will simply set your texture, glBegin your primitive type and put all the vertices. Would be very useful implementing single surface particle systems for example.


Dreamora(Posted 2007) [#6]
funny is, that does not make a difference.

I've created an extended driver as I assumed it would make a difference as well, if you could "lock" the drawing texture and just add geometry to one glBegin call.

But it didn't
Even on my crappy GMA900 the difference were only 2 FPS ...

Do not ask me why it does not make a real difference. perhaps I did something wrong, although I'm quite sure that I am capable of looping through an array and call the correct glbegin / end stuff.

Thats why I came to the conclusion that it only makes sense if we introduced real depth and could just create "1 mesh" per texture and use that one ...


Jake L.(Posted 2007) [#7]
Dreamora is right, I did those tests as well and found no real difference, with a few exceptions:

Keeping surfacecount (Images) low and drawing only portions of a texture instead (by messing with the UVs) DOES make a difference.

Also, Vertexbuffers are faster if you don't update them very often, say for backdrops or static text.


Dreamora(Posted 2007) [#8]
Jupp the "DrawImageArea" approach is far supperior to the approach of using VertexArrays and the like for dynamic things.

The only real usage of the "glBegin - glEnd" would be when you use a tilemap system combined with the DrawImageArea approach. In that case it would seriously boost the whole thing, I'm sure (make the vertex data visible tiles + 2 on all borders and you can keep it for quite some time)

But not for those things where single surface / batching normally is used like particle systems etc


FlameDuck(Posted 2007) [#9]
he claims that I am hugely unoptimized with my rendering pipeline.
He is correct.

if BlitzMax had to do this for you and still keep the order of your draw commands resulting in the same order on screen it'd have a lot more to do.
Not really. You just need a insertion ordered map (like the LinkedHashMap in Java).

Even on my crappy GMA900 the difference were only 2 FPS ...
That's odd, because on my crappy GeForce 6800 Go I got more than a factor 2 improvement on a similar test.


N(Posted 2007) [#10]
Your rendering pipeline is unoptimized- it could be worse, but in order for that to be true you would have to intentionally make it worse. You're pretty much at the bottom of the ladder here in performance. This is not necessarily your fault, however, seeing as how you didn't write the 2D code (I assume you didn't, anyway, since I'm doubtful that you wrote your own 2D renderer or Max2D implementation).

BRL are the ones who wrote the 2D code for BMax, and for some reason they neglected to think about any form of optimization. Writing a decent and basic 2D renderer (right now you just have basic, to point out the difference) is not difficult, and that BRL hasn't done it is really just laziness on their part, not because it's troublesome.


Chroma(Posted 2007) [#11]
Bye Noel. /wave


Dreamora(Posted 2007) [#12]
Noel: I'm not sure if it really was lazyness and not just the attempt to make sure its real OGL 1.2 compliant and works even on the OGL 1.1 emulation on XP (which it in fact does!)

Opting it further would only have made sense if it added features that really benefited from it. But that would mean that things like Render to texture, tilemaps, chunked texture etc would have been needed to really make use of it (similar to TGB)

Not with the pure Blitz 2D wrap it is.


ImaginaryHuman(Posted 2007) [#13]
They could have done better even with GL1.1 compatibility, like at least tracking of textures and state and batch rendering and vertex arrays.


N(Posted 2007) [#14]
Noel: I'm not sure if it really was lazyness and not just the attempt to make sure its real OGL 1.2 compliant and works even on the OGL 1.1 emulation on XP (which it in fact does!)

Opting it further would only have made sense if it added features that really benefited from it. But that would mean that things like Render to texture, tilemaps, chunked texture etc would have been needed to really make use of it (similar to TGB)

Not with the pure Blitz 2D wrap it is.
Optimizing it further and maintaining even 1.1 compatibility would be incredibly easy. This is laziness, don't mistake it for something akin to a wise design decision that benefits you.