Threading Performance (more primes)
BlitzMax Forums/BlitzMax Programming/Threading Performance (more primes)
| ||
Figured its that time again.. So i finally got a good reason (ie, new computer) to mess around a bit with multithreading in blitzmax. The best candidate for mt i had around was the prime sieve i wrote and posted on Nates thread months ago, and heres the resulting code: EDIT: new code posted further down. Last edited 2011 |
| ||
Times on my old system:Athlon 1900 @ 1,6ghz / 512mb ram / ubuntu 10-something test 1: 56,3330002 seconds test 2: 109 seconds / 2 threads New system: phenom 2 x6 @ 3,36ghz / 16gb ram / ubuntu 11.04 test 1: 5,11600018 seconds test 2: 2,17 seconds / 12 threads Last edited 2011 |
| ||
Me too, the speed of the garbage collector would be very important for a non-trivial game. |
| ||
Test 2:DONE! (3.13 s) Asus G73 - i7 740 Not bad for a laptop? :) There seems to be some threading problem in test2. Sometimes it crashes with access violation. |
| ||
@Mahan: Yeah, i didnt put too much effort into the rewrite :p Im not really sure why it happens, since it usually finishes with what seems to be correct results. Also, could you please post results for the first test? :) Last edited 2011 |
| ||
Sure. New timings. Came to think of that I got several IDEs and a VM started. Not optimal for speed tests so i closed them and turned on the laptops "turbo" gaming mode. Test 1: The 100000000th prime is 2038074743. ( total running time: 6.40999985 seconds ) Test 2: DONE! (2.78 s) Asus G73 - i7 740 in Asus Power2Gear turbo mode |
| ||
My experiences with threading: http://blitzmax.com/Community/posts.php?topic=93169#1064888 It seems passing objects to another thread is slow, so clumping data processing is ideal. |
| ||
@Mahan, thanks. I forgot to ask, but if you run that while looking at cpu monitor of choice, does it put full load on all cores? I noticed a pretty bad load of ~60% per core when i run the second sieve. Hopefully thats room for improvement and not the GC interfering.. @Adam, ill take a look on that :) Heres a little something, a benchmark of sorts that does a bunch of fibonacci sequences over and over, increasing the number of active threads by one for each iteration. It goes up to 16, which might take a few minutes. It has the same issue as the sieve, its not really generating much load on each core when the number of threads go up :/ Also it seems to be much faster to have static threads you feed new data instead of generating new threads along with new data. EDIT: Updated the code a bit. Also please take a look on the new sieve code further down :) My results: The number in the brackets is the number of threads activated, along with the percentage increase over the single thread run(s). It runs a bit choppy, and seem to stall occasionally for some reason.. Phenom 2 X6 (ie 6 cores) [1] 100% [2] 186% [3] 271% [4] 263% [5] 261% [6] 341% [7] 318% [8] 315% [9] 323% [10] 321% [11] 359% [12] 373% [13] 401% [14] 395% [15] 397% [16] 394% Last edited 2011 Last edited 2011 Last edited 2011 Last edited 2011 Last edited 2011 |
| ||
Test1: The 100000000th prime is 2038074743. ( total running time: 6.21400023 seconds ) Test2: I get random EAVs. Sometimes it gets to 800, sometimes 1600, sometimes immediately after stating 'STARTING MAIN SIEVE'. It never finishes. I set MaxThreads to 4 as I have a 2xCore. Sony VAIO VGN-FW31M Vista Home SP2. @AdamRedwoods Did you try passing in objects by ref? or passing the whole object in? |
| ||
Yeah sorry about that piece of code. I rewrote the second test and eliminated two possible causes of segfaults. Seems to be stable for me at least now. Speed is about the same, slightly more then twice as fast as the first test. |
| ||
Also, is there some special reason that we cant use (i assume this, it seemed to casue me a lot of trouble) local variables in the threaded code? This hurts performance a lot.. Using vars from an object instance instead of locals in function scope make things over twice as slow for me in a simple test. The second sieve will run in ~18s with just one thread and ~3s with a suitable amount of threads, which is a pretty ideal increase in performance. It bothers me however that the non-mt version does the same thing in ~6 seconds, which the mt version need three threads (with available cores) to match.. Last edited 2011 |
| ||
works every time now.... done! 9.435 seconds. found 100042089 primes. the 100000000th prime is 2038074743. |
| ||
well thats certainly odd :s i assume its the same computer as the times you posted before? |
| ||
Why can't you use local variables? The object reference itself will be stored in a local variable. |
| ||
Well im not sure, but every time i tried to rewrite the thread function to use locals instead of the instance fields i get segfaults without any obvious reason, just wondering if thats an error on my part or not, which i guess it is then. |
| ||
done! 4.546 seconds. found 100042089 primes. the 100000000th prime is 2038074743 (AMD II X4 640 Quad core 3GHz) |
| ||
AMD II X4 640, 4 cores [1] 100% [2] 189% [3] 265% [4] 250% [5] 268% [6] 264% [7] 287% [8] 293% [9] 296% [10] 299% [11] 313% [12] 298% [13] 309% [14] 326% [15] 320% [16] 328% |
| ||
Yep, same computer. Did you not alter that code then? Third test :- |
| ||
No the code is pretty much the same. The only thing i changed except a bug fix was to keep threads more persistent instead of spawning a new thread for every block of data. It didnt seem to affect anyone else :s You could try changing this stuff at the top of the code: Const SEGMENTSIZE:Int=2*3*5*7*11*13*17 Global ArrSkipValues:Int[]=[2,3,5,7,11,13,17] to Const SEGMENTSIZE:Int=2*3*5*7*11*13 Global ArrSkipValues:Int[]=[2,3,5,7,11,13] Other than that i cant think of anything right now :/ |
| ||
Seems to scale nicely:Sieving range 0..2038074744 in 1997 segments. ....... done! 1.570 seconds. found 100042089 primes. the 100000000th prime is 2038074743 |
| ||
@brucey: would you care to post your result on the first sieve and system specs? just out of curiosity :p |
| ||
First results were for 24 threads. Test 1 results : The 100000000th prime is 2038074743. ( total running time: 5.05499983 seconds ) Done! Best I've had for the above is about 5.02. Specs... I think it's 4 x 6 core Intel Xeon X5650 @ 2.67GHz ...and more RAM than you can throw a stick at... but our BlitzMax apps can only see a snippet of that, unfortunately... |
| ||
Ok thanks :p Was guessing a server-geared cpu. Actually what number of threads did you use? Id have expected that to run even faster. |
| ||
24 threads... but there's a lot of other stuff going on at the same time (2 Oracle instances, etc). CPU shows about 1100% on average. |
| ||
Ok, i figured out what i was doing wrong with the threaded function in the second sieve. The actual sieving code in both programs is now identical. This is probably a bit much to ask for, but ill give it a try. If you just want to see some times, then just run both pieces of code (adjust number of threads!) and post the results. If you got some spare time then please read these instructions: INSTRUCTIONS: start with the second sieve and read the comments at the top. When you have found a suitable segment size for your cpu, go to the first sieve and at the start of the TSoE object, there is the same calcuation being done. Modify this so both sieves has the same value. Now run the benchmarks and post some times :) threaded sieve: original sieve: Last edited 2011 Last edited 2011 |
| ||
phenom 2 x6 @ stock 2,8Ghz: original sieve: The 100000000th prime is 2038074743. ( total running time: 6.45400000 seconds ) threaded sieve: segment memory: 240 kb threads: 12 sieving range 0..2038074744 in 4242 segments................. done! 1.197 seconds. found 100005631 primes. the 100000000th prime is 2038074743 Process complete Thats some very nice scaling i must say :) 539% increase in performance. Last edited 2011 |
| ||
Intel Core 2 T6400 original sieve: The 100000000th prime is 2038074743. ( total running time: 9.16199970 seconds ) threaded sieve: segment memory: 240 kb threads: 4 sieving range 0..2038074744 in 4242 segments................. done! 5.585 seconds. found 100005631 primes. the 100000000th prime is 2038074743 164% increase, good enough i guess. |
| ||
T1: The 100000000th prime is 2038074743. ( total running time: 7.46299982 seconds ) T2: segment memory: 180 kb threads: 4 sieving range 0..2038074744 in 5656 segments....................... done! 4.218 seconds. I found that using a value of 6 gave 180kb of cache and was the fastest by .1 sec :p Other values slowed it slightly. It made the cooling fan come on too after several quick successions :D |
| ||
Intel i7-2600, using 8 threads and PRIMEFACTORBASE*16 segment memory: 480 kb threads: 8 sieving range 0..2038074744 in 2121 segments......... done! 1.070 seconds. found 100005631 primes. the 100000000th prime is 2038074743 NOTE: the time before changing 8 to 16 was 1.289 seconds. The original test with same change from 8 to 16: The 100000000th prime is 2038074743. ( total running time: 4.36199999 seconds ) This actually got a little slower! Time before changing 8 to 16 was 4.08400011 seconds. In fact the time dropped to 4.053 when I changed 8 to 4. |
| ||
Sounds very exciting, but for the life of it I can't compile it... compiled and recompiled modules, which have brl.threads amongst them, but I keep getting: Identifier 'Tmutex' not found I wonder, if I somehow missed something in that regard? Any suggestions would make me very happy. Sorry to bother you guys with it, too. WAAhhh...me soooo stupid. Never mind. Got it! Thread Build, yes,yes... sorry about that. As my punishment, here are the numbers I'm getting on my macpro 8core: [1] 100% [2] 199% [3] 296% [4] 383% [5] 382% [6] 386% [7] 356% [8] 399% [9] 425% [10] 441% [11] 431% [12] 490% [13] 482% [14] 435% [15] 441% [16] 462% Last edited 2011 |