How's MT GC speed doing?

BlitzMax Forums/BlitzMax Programming/How's MT GC speed doing?

JoshK(Posted 2009) [#1]
Does anyone have some benchmarks for recent versions of BMX for MT GC vs. single-threaded GC speed? I understand the MT GC has gotten faster and I am curious to see how they compare.


xlsior(Posted 2009) [#2]
I don't have any benchmark numbers handy, but I can tell you that I definitely saw a huge speedup compared to the old MT GC.

The initial MT release was 10-20 times slower than single-threaded in my app, the current one maybe one and a half times slower.
The original MT GC meant that any speed increase by splitting into multiple threads was completely off-set by the extra GC overhead -- that appears to be no longer the case.


ImaginaryHuman(Posted 2009) [#3]
I just did some threaded stuff, 4 threads, and it seems the garbage collector has much less impact on the speed, and saw a better speedup from using threads. Also I am no longer seeing the glitchy framerate stuttering which used to be caused by the old one.


JoshK(Posted 2009) [#4]
BlitzMax 1.36
This is just running the default project the engine project wizard creates

MT=336 FPS
ST=356 FPS

So in this test, MT mode was 94% the speed of ST. Of course, I don't know if that percentage will stay the same in a more complex environment.


JoshK(Posted 2009) [#5]
When I try this code:


MT=160
ST=200

In this case, MT is 80% as fast. That's not very encouraging. You might be able to offload some parallel problems onto other cores, but it's going to be a struggle if all your code automatically drops to 80% speed.


Azathoth(Posted 2009) [#6]
In this case, MT is 80% as fast. That's not very encouraging. You might be able to offload some parallel problems onto other cores, but it's going to be a struggle if all your code automatically drops to 80% speed.
If you haven't changed your code to take advantage of multiple threads you are not going to see a speed increase. This isn't just a Blitzmax thing.


Brucey(Posted 2009) [#7]
If you haven't changed your code to take advantage of multiple threads you are not going to see a speed increase.

I was going to mention that too, but it seemed rather obvious.


Tommo(Posted 2009) [#8]

If you haven't changed your code to take advantage of multiple threads you are not going to see a speed increase. This isn't just a Blitzmax thing.


It's not about speed increase, but speed loss with same source code.

I think there are something more Mark can do to improve his MT GC. Although it's actually fast enough now.


ImaginaryHuman(Posted 2009) [#9]
The original 94% quoted is not correct because fps rates do not decrease linearly. ;-D .. just teasing.

A 20% drop does sound a bit alarming. 90-100% would be good. I would be happy with that provided that by using other threads to do extra work I overall gain much more than 100% of the single-threaded speed. For me I found that with 4 threads on a 4-core machine a threaded routine was at least 2-3 times faster, and on a dual-core it was almost twice as fast.


Azathoth(Posted 2009) [#10]
It's not about speed increase, but speed loss with same source code.
Ofcourse thats going to happen if you're using the thread safe GC with code thats designed to run as one thread, its doing redundant checks.


JoshK(Posted 2009) [#11]
If you haven't changed your code to take advantage of multiple threads you are not going to see a speed increase. This isn't just a Blitzmax thing.

Well obviously, but when parallelizing code, you aren't just fighting the part of the code you are making parallel. You are fighting a 20% drop in speed across all code.

Let's say parallel programming results in 30% of the program loop being twice as fast. But there is that 20% reduction in speed across the board.

30 * 0.8 * 2.0 = 48

Now the remaining 70% of the code is 20% slower:

70 * 0.8 = 56

Add them together:

56 + 48 = 104

The net result is only a 4% increase in speed. So in this case, all your effort was basically wasted.


H&K(Posted 2009) [#12]
First you have the maths wrong. (Not important, just FYI)

70*1.2 = 84 That is took 70 units of time now takes 84
30*0.5*1.2 = 18 That is took 30 units of time now takes 18

= 102 percent of the original TIME, Meaning 2 % SLOWER.

My main question is, are you running this on a single thread machine?
If we can assume you are, not then we need lots better speed. If it is a single thread, then nothing can be taken from you times.


JoshK(Posted 2009) [#13]
I have a quad core, but my app runs on all configurations.

The point is even with a quad core, it is questionable whether any savings can be achieved in the overall program speed. Realistically, I think we might be looking at 10-15% improvement in the best case scenario.


ImaginaryHuman(Posted 2009) [#14]
That's not true. Like I said in my little app, I have two pieces of cpu-intensive code - a large particle simulation and a fluid dynamics simulation. I do both in 4 threads. The two routines together do maybe 80% of the app's work, the rest is graphics rendering. On a dual core I now see it using around 70% of the total cpu time, whereas it used to use only 50%. On a quad core it uses up to about 60% of the total cpu time. The end result is that the app runs considerably smoother, I'd say somewhere between 170% and 250% faster.

We have to remember also that a) the amount of memory accesses can stall all of the cpu's because there is only so much bus bandwidth, b) memory accesses all have to share the same bus so even though you can process `in parallel` once the data is in the cpu/cache, if it's not there is a fight for resources, and c) having 4 times as many cores doesn't give you 400% speedup because the rest of the architecture is not 4 times faster or 4 times `as parallel`.


JoshK(Posted 2009) [#15]
That's good to hear.


Otus(Posted 2009) [#16]
We have to remember also that a) the amount of memory accesses can stall all of the cpu's because there is only so much bus bandwidth, b) memory accesses all have to share the same bus so even though you can process `in parallel` once the data is in the cpu/cache, if it's not there is a fight for resources

On the other hand, a program that is limited by memory latencies (as opposed to bandwidth) could scale even better by running two threads per core. With even DDR2 bandwidth well in the GB-range few programs need all the bandwidth. I have a program which I think is memory latency limited (it accesses hundreds of MB in a nonlinear fashion), but I haven't gotten around to multi-threading it yet. However, since I now have a quad-core it is on my todo list. I hope the new GC will work well.

having 4 times as many cores doesn't give you 400% speedup because the rest of the architecture is not 4 times faster or 4 times `as parallel`.

Yes and a CPU with 4 times the frequency won't give a 400% speedup either (other things constant). Realistically, one shouldn't expect multi-threaded performance to be a multiple of the old performance in most cases, but it's encouraging if that is the case in some situations.


ImaginaryHuman(Posted 2009) [#17]
I think also that if you go multithreaded you may want to take another look at optimizing your code more to reduce memory accesses - try to use as many local variables as you can, try to design the algorithms/structure to favor only needing a small subset of locals at a time, try pipelining some of your commands so that they do stuff with locals in-between doing stuff with globals/memory variables. etc I think that helped in my case. I also saw some forums elsewhere talking about this and that it helps if either all the threads run on the same data so that the caches get shared, (I do my fluid simulation with an interleaved skip through the data by each thread), and also it may help to make each thread's data accesses not dependent on or use the same data as other threads (I do my particle thing that way, each thread had its own set of particles) and state which is completely separate from the other threads. I would also try to avoid having to need to use mutexes or any of the other o/s api commands regarding threads as in general this will slow things down if you use them a lot. I don't use any mutexes whatsoever in my app, but having said that it is not thread safe and there are some errors - but within the context of what I'm using it for you don't really notice the effects. And finally, the more threads you have, the more time the o/s has to take to do all the context switches, they use more memory for stacks and stuff, so more threads is not necessarily more efficient.