Threading Performance (more primes)

BlitzMax Forums/BlitzMax Programming/Threading Performance (more primes)

zzz(Posted 2011) [#1]
Figured its that time again.. So i finally got a good reason (ie, new computer) to mess around a bit with multithreading in blitzmax. The best candidate for mt i had around was the prime sieve i wrote and posted on Nates thread months ago, and heres the resulting code:

EDIT: new code posted further down.

Last edited 2011


zzz(Posted 2011) [#2]
Times on my old system:
Athlon 1900 @ 1,6ghz / 512mb ram / ubuntu 10-something

test 1: 56,3330002 seconds 
test 2: 109 seconds / 2 threads


New system:
phenom 2 x6 @ 3,36ghz / 16gb ram / ubuntu 11.04

test 1: 5,11600018 seconds
test 2: 2,17 seconds / 12 threads


Last edited 2011


Czar Flavius(Posted 2011) [#3]
Me too, the speed of the garbage collector would be very important for a non-trivial game.


Mahan(Posted 2011) [#4]
Test 2:
  DONE!						(3.13 s)


Asus G73 - i7 740

Not bad for a laptop? :)

There seems to be some threading problem in test2. Sometimes it crashes with access violation.


zzz(Posted 2011) [#5]
@Mahan: Yeah, i didnt put too much effort into the rewrite :p Im not really sure why it happens, since it usually finishes with what seems to be correct results.

Also, could you please post results for the first test? :)

Last edited 2011


Mahan(Posted 2011) [#6]
Sure.

New timings. Came to think of that I got several IDEs and a VM started. Not optimal for speed tests so i closed them and turned on the laptops "turbo" gaming mode.

Test 1:
The 100000000th prime is 2038074743. ( total running time: 6.40999985 seconds )


Test 2:
  DONE!						(2.78 s)


Asus G73 - i7 740 in Asus Power2Gear turbo mode


AdamRedwoods(Posted 2011) [#7]
My experiences with threading:
http://blitzmax.com/Community/posts.php?topic=93169#1064888

It seems passing objects to another thread is slow, so clumping data processing is ideal.


zzz(Posted 2011) [#8]
@Mahan, thanks. I forgot to ask, but if you run that while looking at cpu monitor of choice, does it put full load on all cores? I noticed a pretty bad load of ~60% per core when i run the second sieve. Hopefully thats room for improvement and not the GC interfering..

@Adam, ill take a look on that :)

Heres a little something, a benchmark of sorts that does a bunch of fibonacci sequences over and over, increasing the number of active threads by one for each iteration. It goes up to 16, which might take a few minutes.

It has the same issue as the sieve, its not really generating much load on each core when the number of threads go up :/ Also it seems to be much faster to have static threads you feed new data instead of generating new threads along with new data.

EDIT: Updated the code a bit. Also please take a look on the new sieve code further down :)



My results: The number in the brackets is the number of threads activated, along with the percentage increase over the single thread run(s). It runs a bit choppy, and seem to stall occasionally for some reason..


Phenom 2 X6 (ie 6 cores)

[1] 100%
[2] 186%
[3] 271%
[4] 263%
[5] 261%
[6] 341%
[7] 318%
[8] 315%
[9] 323%
[10] 321%
[11] 359%
[12] 373%
[13] 401%
[14] 395%
[15] 397%
[16] 394%



Last edited 2011

Last edited 2011

Last edited 2011

Last edited 2011

Last edited 2011


col(Posted 2011) [#9]
Test1:

The 100000000th prime is 2038074743. ( total running time: 6.21400023 seconds )

Test2:
I get random EAVs. Sometimes it gets to 800, sometimes 1600, sometimes immediately after stating 'STARTING MAIN SIEVE'. It never finishes.
I set MaxThreads to 4 as I have a 2xCore.

Sony VAIO VGN-FW31M Vista Home SP2.

@AdamRedwoods
Did you try passing in objects by ref? or passing the whole object in?


zzz(Posted 2011) [#10]
Yeah sorry about that piece of code. I rewrote the second test and eliminated two possible causes of segfaults. Seems to be stable for me at least now. Speed is about the same, slightly more then twice as fast as the first test.


zzz(Posted 2011) [#11]
Also, is there some special reason that we cant use (i assume this, it seemed to casue me a lot of trouble) local variables in the threaded code? This hurts performance a lot.. Using vars from an object instance instead of locals in function scope make things over twice as slow for me in a simple test.

The second sieve will run in ~18s with just one thread and ~3s with a suitable amount of threads, which is a pretty ideal increase in performance. It bothers me however that the non-mt version does the same thing in ~6 seconds, which the mt version need three threads (with available cores) to match..

Last edited 2011


col(Posted 2011) [#12]
works every time now....

done! 9.435 seconds.

found 100042089 primes. the 100000000th prime is 2038074743.


zzz(Posted 2011) [#13]
well thats certainly odd :s i assume its the same computer as the times you posted before?


Czar Flavius(Posted 2011) [#14]
Why can't you use local variables? The object reference itself will be stored in a local variable.


zzz(Posted 2011) [#15]
Well im not sure, but every time i tried to rewrite the thread function to use locals instead of the instance fields i get segfaults without any obvious reason, just wondering if thats an error on my part or not, which i guess it is then.


xlsior(Posted 2011) [#16]
done! 4.546 seconds.

found 100042089 primes. the 100000000th prime is 2038074743

(AMD II X4 640 Quad core 3GHz)


xlsior(Posted 2011) [#17]
AMD II X4 640, 4 cores

[1] 100%
[2] 189%
[3] 265%
[4] 250%
[5] 268%
[6] 264%
[7] 287%
[8] 293%
[9] 296%
[10] 299%
[11] 313%
[12] 298%
[13] 309%
[14] 326%
[15] 320%
[16] 328%


col(Posted 2011) [#18]
Yep, same computer. Did you not alter that code then?

Third test :-




zzz(Posted 2011) [#19]
No the code is pretty much the same. The only thing i changed except a bug fix was to keep threads more persistent instead of spawning a new thread for every block of data. It didnt seem to affect anyone else :s

You could try changing this stuff at the top of the code:
Const SEGMENTSIZE:Int=2*3*5*7*11*13*17
Global ArrSkipValues:Int[]=[2,3,5,7,11,13,17]

to
Const SEGMENTSIZE:Int=2*3*5*7*11*13
Global ArrSkipValues:Int[]=[2,3,5,7,11,13]


Other than that i cant think of anything right now :/


Brucey(Posted 2011) [#20]
Seems to scale nicely:
Sieving range 0..2038074744 in 1997 segments.
.......
done! 1.570 seconds.

found 100042089 primes. the 100000000th prime is 2038074743



zzz(Posted 2011) [#21]
@brucey: would you care to post your result on the first sieve and system specs? just out of curiosity :p


Brucey(Posted 2011) [#22]
First results were for 24 threads.

Test 1 results :
The 100000000th prime is 2038074743. ( total running time: 5.05499983 seconds )
Done!

Best I've had for the above is about 5.02.

Specs...
I think it's 4 x 6 core Intel Xeon X5650 @ 2.67GHz
...and more RAM than you can throw a stick at... but our BlitzMax apps can only see a snippet of that, unfortunately...


zzz(Posted 2011) [#23]
Ok thanks :p Was guessing a server-geared cpu. Actually what number of threads did you use? Id have expected that to run even faster.


Brucey(Posted 2011) [#24]
24 threads... but there's a lot of other stuff going on at the same time (2 Oracle instances, etc).

CPU shows about 1100% on average.


zzz(Posted 2011) [#25]
Ok, i figured out what i was doing wrong with the threaded function in the second sieve. The actual sieving code in both programs is now identical.

This is probably a bit much to ask for, but ill give it a try. If you just want to see some times, then just run both pieces of code (adjust number of threads!) and post the results. If you got some spare time then please read these instructions:

INSTRUCTIONS: start with the second sieve and read the comments at the top. When you have found a suitable segment size for your cpu, go to the first sieve and at the start of the TSoE object, there is the same calcuation being done. Modify this so both sieves has the same value. Now run the benchmarks and post some times :)

threaded sieve:


original sieve:


Last edited 2011

Last edited 2011


zzz(Posted 2011) [#26]
phenom 2 x6 @ stock 2,8Ghz:

original sieve:
The 100000000th prime is 2038074743. ( total running time: 6.45400000 seconds )

threaded sieve:
segment memory: 240 kb
threads: 12
sieving range 0..2038074744 in 4242 segments.................
done! 1.197 seconds.

found 100005631 primes. the 100000000th prime is 2038074743

Process complete


Thats some very nice scaling i must say :) 539% increase in performance.

Last edited 2011


zzz(Posted 2011) [#27]
Intel Core 2 T6400

original sieve:
The 100000000th prime is 2038074743. ( total running time: 9.16199970 seconds )


threaded sieve:
segment memory: 240 kb
threads: 4
sieving range 0..2038074744 in 4242 segments.................
done! 5.585 seconds.

found 100005631 primes. the 100000000th prime is 2038074743


164% increase, good enough i guess.


col(Posted 2011) [#28]
T1:
The 100000000th prime is 2038074743. ( total running time: 7.46299982 seconds )

T2:
segment memory: 180 kb
threads: 4
sieving range 0..2038074744 in 5656 segments.......................
done! 4.218 seconds.


I found that using a value of 6 gave 180kb of cache and was the fastest by .1 sec :p
Other values slowed it slightly. It made the cooling fan come on too after several quick successions :D


Floyd(Posted 2011) [#29]
Intel i7-2600, using 8 threads and PRIMEFACTORBASE*16


segment memory: 480 kb
threads: 8
sieving range 0..2038074744 in 2121 segments.........
done! 1.070 seconds.

found 100005631 primes. the 100000000th prime is 2038074743

NOTE: the time before changing 8 to 16 was 1.289 seconds.




The original test with same change from 8 to 16:

The 100000000th prime is 2038074743. ( total running time: 4.36199999 seconds )

This actually got a little slower! Time before changing 8 to 16 was 4.08400011 seconds.

In fact the time dropped to 4.053 when I changed 8 to 4.


Taron(Posted 2011) [#30]
Sounds very exciting, but for the life of it I can't compile it... compiled and recompiled modules, which have brl.threads amongst them, but I keep getting: Identifier 'Tmutex' not found

I wonder, if I somehow missed something in that regard? Any suggestions would make me very happy. Sorry to bother you guys with it, too.

WAAhhh...me soooo stupid. Never mind. Got it! Thread Build, yes,yes... sorry about that.

As my punishment, here are the numbers I'm getting on my macpro 8core:
[1] 100%
[2] 199%
[3] 296%
[4] 383%
[5] 382%
[6] 386%
[7] 356%
[8] 399%
[9] 425%
[10] 441%
[11] 431%
[12] 490%
[13] 482%
[14] 435%
[15] 441%
[16] 462%

Last edited 2011