Threads and Performance

BlitzMax Forums/BlitzMax Programming/Threads and Performance

Otus(Posted 2009) [#1]
Ok, I took a first look at threads with the thought of using both cores of my Athlon X2 for a math intensive task. I compiled the same program in threaded mode without actually using threads. The task took 3x as long.

Am I right in assuming this is due to the other GC?

Thinking that is the case, I added GCSuspend/GCResume around the code. This cut down the time in both cases, but in the threaded build I got an error I haven't seen before: "Too many heap sections"

So... Threads should only be used to make the UI responsive?


ImaginaryHuman(Posted 2009) [#2]
Well, I think you are on the right track. The threaded garbage collector is slower than the default one. However, right now you're seeing the time increase by using the garbage collector without getting a second thread doing computations. Have you actually added a second thread, or even spawn 2 threads and let the main thread give time to the cpu (Delay())?

Generally speaking, on the whole, even if there is some overhead from using threads (and there is, garbage collector or not), you should be gaining much more computation time than you're losing. Also, are you compiling in debug mode? That'll be a lot slower than release mode.


xlsior(Posted 2009) [#3]
I'm definitely seeing a performance difference too on my computer:

My current program does a bunch of parsing on a large text files.
doing a bunch of operations on an array with 10,000 strings takes 0.8 seconds in the default mode, and if I switch over to the 'threaded build' (so no changes, just using the other garbage collector) the same code takes 6.5 seconds to complete.

I guess it just goes to show that you'll need to be selective in whether or not it makes sense to go threaded. Splitting up the code to run on two CPU cores doesn't necessarily make much sense if it still takes 4 times as long to complete.


ziggy(Posted 2009) [#4]
Code designed to work a on a single thread bmax compilation is usually slower on threaded mode. when the diference of time is this big, it usually means the code has to be redesigned, or optimized. Obviously if you're writing a threaded application from the begining, you would not be getting this issues.

As instance, a decent multi-threaded parser would implement a string stream (instead of working with 10000 strings!!!) to make the tokening. This string stream could be used to read from the same base string, from different threads, at the same time without any problems. This could make the parsing go faster than a single brute-force thread reading byte by byte.

I mean, making a threaded application is not just putting a parameter on the compiler, it is designing the application to be fast using multithread-based algorithms. If you're not going to use this kind of algorithms, I'm sure you can expect all kind of drawbacks and performance lost.


ImaginaryHuman(Posted 2009) [#5]
Yah you also need to consider things like cache access, if two threads are competing for memory time and are accessing it quite randomly to get the string data, that's going to be less efficient than when one thread is just going through contiguous memory.

I did personally notice some general slowdown when using a threaded version of an app. One thing to consider also is that although you may now be using 2 or more cpu cores, they still have to *share* the same memory bus. You aren't getting twice the performance in memory accesses. You could throw 16 cores at it but if they're all accessing the same memory each thread will be held up by memory accesses. I would also think that if you are doing operations that entail a lot of memory access - like working with strings, then memory is going to become a bottleneck and the cpu is going to be kept waiting. Also if the cpu's have to share the memory bus one is going to have to stall while the other one has bus access.

This is partly why the Cell processor gives local memory to each individual cpu core so they don't have to share bus accesses.


Otus(Posted 2009) [#6]
ziggy:
Code designed to work a on a single thread bmax compilation is usually slower on threaded mode. when the diference of time is this big, it usually means the code has to be redesigned, or optimized. Obviously if you're writing a threaded application from the begining, you would not be getting this issues.


The thing is, I didn't bother to try making it threaded. Even if I would have been able to halve the time taken by using two threads (unlikely), it would have been slower than unthreaded.

Note that my code was almost pure math and used one CPU almost 100%. For something like the parser you mention, there could be additional benefits from backgrounding memory access.

ImaginaryHuman:
You could throw 16 cores at it but if they're all accessing the same memory each thread will be held up by memory accesses. I would also think that if you are doing operations that entail a lot of memory access - like working with strings, then memory is going to become a bottleneck and the cpu is going to be kept waiting. Also if the cpu's have to share the memory bus one is going to have to stall while the other one has bus access.


Isn't memory access rather fast these days? With DDR3 becoming popular, I don't think that will be such a huge bottleneck compared to using a single CPU core. They will probably not increase the speed of CPUs in the near future, but add more cores instead. Also, modern multi-cores have dedicated caches for each core and the sizes are increasing.


ImaginaryHuman(Posted 2009) [#7]
It doesn't matter how fast the memory is, if both threads have to share the same memory and they're both trying to do memory intensive tasks they are going to have to cooperate and share, which means each gets less time, possible only 50% as much time as they'd like. It doesn't matter even if your ram is super fast, two cores trying to access it at the same time has to be split.


Otus(Posted 2009) [#8]
It doesn't matter how fast the memory is, if both threads have to share the same memory and they're both trying to do memory intensive tasks they are going to have to cooperate and share, which means each gets less time, possible only 50% as much time as they'd like. It doesn't matter even if your ram is super fast, two cores trying to access it at the same time has to be split.


Yes, but that is only an issue if the program is limited by memory access speed. With faster memory and bus that is less likely.


skn3(Posted 2009) [#9]
Unless of course you have dual channel memory, as this would allow double the memory throughput! (supposedly)

Have you tried designing a "job" assignment thread setup.
EG:

Main core (or a new core designated as "the boss") assigns "chunks" of data as individual jobs. Threads can request a job and take the chunk of memory associated. The thread is then free to perform calculations on that job and return the info to the boss thread.

This would allow you to spawn multiple threads processing on your mathmatical problem.

As everyone else has said, you cant just expect a speed increase by moving an algorythm of some sort to a seperate thread. This is not logical and is not what threading is intended for. If however you wanted to run these mathmatical problems alongside say a graphical user interface or visual display, then threading is going to sky rocket in terms of performance.

Naturally you can expect a speed decrease as the system has to do a whole lot more. I wouldnt have thought such a drastic speed decrease though.


ImaginaryHuman(Posted 2009) [#10]
One thing I also noticed when I implemented a thread pool, is that I got less performance than expected. But what I didn't factor in was the way the o/s redistributes system threads. If I run a single threaded app it will hog all of the time of the core it runs on and all other threads seem to get shifted to the other core (with time to spare). So then I measure the time it takes and figure that I could double that when I move to multithreaded. But that's not true - the system threads and other threads still have to get their time. It ended up being more like adding an extra 60-70% cpu time by running a second thread in the pool, rather than a 100% speedup. That said, I expect if you had like 3 or more cores you would start to see a bigger speedup for each core you add since you aren't adding more system threads.


Space_guy(Posted 2009) [#11]
I pulled my hair when i tried to make a game engine threaded. I just couldnt make it as smooth as it was with the non threadeed GC.
It seems no matter how i set up the garage collection it just continued being jerky. for example every 2 seconds the fps would drop to arround 20 from an otherwise steady 60. All i could do was to change abit how long it would take between these frame skips.

In the end i just decided it wasnt worth putting more time into until improvments are made to the GC or soeone else finds out how to use it better.