Threading Performance (more primes)

BlitzMax Forums/BlitzMax Programming/Threading Performance (more primes)

zzz	(Posted 2011) [#1]

Figured its that time again.. So i finally got a good reason (ie, new computer) to mess around a bit with multithreading in blitzmax. The best candidate for mt i had around was the prime sieve i wrote and posted on Nates thread months ago, and heres the resulting code:

EDIT: new code posted further down.

Last edited 2011

zzz	(Posted 2011) [#2]

Times on my old system:

Athlon 1900 @ 1,6ghz / 512mb ram / ubuntu 10-something

test 1: 56,3330002 seconds 
test 2: 109 seconds / 2 threads

New system:

phenom 2 x6 @ 3,36ghz / 16gb ram / ubuntu 11.04

test 1: 5,11600018 seconds
test 2: 2,17 seconds / 12 threads

Last edited 2011

Czar Flavius

(Posted 2011) [#3]

Me too, the speed of the garbage collector would be very important for a non-trivial game.

Mahan

(Posted 2011) [#4]

Test 2:

  DONE!						(3.13 s)

Asus G73 - i7 740

Not bad for a laptop? :)

There seems to be some threading problem in test2. Sometimes it crashes with access violation.

zzz	(Posted 2011) [#5]

@Mahan: Yeah, i didnt put too much effort into the rewrite :p Im not really sure why it happens, since it usually finishes with what seems to be correct results.

Also, could you please post results for the first test? :)

Last edited 2011

Mahan

(Posted 2011) [#6]

Sure.

New timings. Came to think of that I got several IDEs and a VM started. Not optimal for speed tests so i closed them and turned on the laptops "turbo" gaming mode.

Test 1:

The 100000000th prime is 2038074743. ( total running time: 6.40999985 seconds )

Test 2:

  DONE!						(2.78 s)

Asus G73 - i7 740 in Asus Power2Gear turbo mode

AdamRedwoods

(Posted 2011) [#7]

My experiences with threading:
http://blitzmax.com/Community/posts.php?topic=93169#1064888

It seems passing objects to another thread is slow, so clumping data processing is ideal.

zzz	(Posted 2011) [#8]

@Mahan, thanks. I forgot to ask, but if you run that while looking at cpu monitor of choice, does it put full load on all cores? I noticed a pretty bad load of ~60% per core when i run the second sieve. Hopefully thats room for improvement and not the GC interfering..

@Adam, ill take a look on that :)

Heres a little something, a benchmark of sorts that does a bunch of fibonacci sequences over and over, increasing the number of active threads by one for each iteration. It goes up to 16, which might take a few minutes.

It has the same issue as the sieve, its not really generating much load on each core when the number of threads go up :/ Also it seems to be much faster to have static threads you feed new data instead of generating new threads along with new data.

EDIT: Updated the code a bit. Also please take a look on the new sieve code further down :)


	Rem

	mt benchmark. use static threads to generate and sum fibonacci sequences.
	do an interval of something like 1 to 20 blocks.

	End Rem

	SuperStrict

	Framework BRL.StandardIO
	Import BRL.Threads
	Import BRL.LinkedList
	Import BRL.Retro

	Rem
	End Rem
	
	Const MAX_THREADS:Int=6
	Global CUR_THREADS:Int

	Rem
	End Rem
	
	Const NEW_DATA:Int=100
	Const PROCESSING:Int=200
	Const WANT_DATA:Int=300
	Const TERMINATE:Int=400

	Type TThreadInterface

		Field state:Int=0
		Field data:TData
		Field mutex:TMutex
		Field i:Int

		Function Create:TThreadInterface()
			Local this:TThreadInterface=New TThreadInterface
			this.state=0
			this.mutex=CreateMutex()
			this.data=Null
			Return this
		End Function

	End Type

	Rem
	End Rem

	Type TData

		Field active:Int
		Field done:Int
		Field sequence:Int[]
		Field sum:Int
		Field t1:Int
		Field t2:Int

		Function Create:TData(p_size:Int)
			Local this:TData=New TData
			this.active=False
			this.done=False
			this.sequence=New Int[p_size]
			this.sum=0
			Return this
		End Function

	End Type
	
	Rem
	End Rem
	
	Function Thread:Object(o:Object)
		Local thread:TThreadInterface=TThreadInterface(o)
		If (thread=Null) Then
			Print "?"
			Return Null
		EndIf
		'thread main loop
		Repeat
			If (TryLockMutex(thread.mutex)=True) Then
				Select (thread.state)
				Rem
				new data, the thread has been assigned new data, so process it.
				End Rem
				Case NEW_DATA
					If (thread.data.sequence[0]=0) Then
						thread.data.sequence[0]=1
						thread.data.sequence[1]=1
						For thread.i=2 To thread.data.sequence.length-1
							thread.data.sequence[thread.i]=thread.data.sequence[thread.i-1]+thread.data.sequence[thread.i-2]
						Next
					EndIf
					If (thread.data.sum=0) Then
						For thread.i=0 To thread.data.sequence.length-1
							thread.data.sum:+thread.data.sequence[thread.i]
						Next
					EndIf
					thread.data.done=True
					thread.data.active=False
					thread.state=WANT_DATA	
				Rem
				End Rem
				Case WANT_DATA
					'..
				Rem
				thread has been told to terminate, so do so.
				End Rem
				Case TERMINATE
					'Print "4"
					UnlockMutex(thread.mutex)
					Return Null
				
				Default
					'Print "5"
					thread.state=WANT_DATA
						
				End Select
				UnlockMutex(thread.mutex)
			EndIf
			Delay 1
		Forever
	End Function

	Rem
	End Rem

	Const ITERATIONS:Int=16
	Global threads:TThreadInterface[ITERATIONS]
	
	For Local i:Int=0 To threads.length-1
		threads[i]=TThreadInterface.Create()
		CreateThread(Thread,threads[i])
	Next
	
	Global data:TData[]
	Global data_count:Int
	
	Const TARGET_MEM:Int=64000000
	Global CUR_BLOCKS:Int=1
	Global CUR_SIZE:Int
	
	Global CUR_STEP:Int=0
	Global MAX_STEP:Int=8
	Global TIME:Int[ITERATIONS]

	Rem
	End Rem
	
	Print "  "

	Repeat
	
		'update step
		
		CUR_STEP:+1
		If (CUR_STEP>MAX_STEP) Then
			CUR_BLOCKS:+1
			Print "  "
			If (CUR_BLOCKS>ITERATIONS) Then
				Exit
			EndIf
			CUR_STEP=1
		EndIf
		
		'CUR_THREADS=MAX_THREADS
		CUR_THREADS=ITERATIONS
		
		If (CUR_BLOCKS<10) Then
			WriteStdout "  [0"+CUR_BLOCKS+" / "
		Else
			WriteStdout "  ["+CUR_BLOCKS+" / "
		EndIf
		
		WriteStdout CUR_THREADS+"]"
		
		'set up data
		
		CUR_SIZE=Int(Float(TARGET_MEM)/Float(CUR_BLOCKS))
		data=New TData[CUR_BLOCKS]
		
		For Local i:Int=0 To data.length-1
			data[i]=TData.Create(CUR_SIZE)
		Next
		
		GCCollect()
		
		Delay 500
		
		'process data
		
		Local timer:Int=MilliSecs()
		
		Repeat
		
			GCSuspend()
		
			'find vacant thread
			For Local i:Int=0 To threads.length-1
				If (TryLockMutex(threads[i].mutex)=True) Then
				
					If (threads[i].state=WANT_DATA) Then
					
						If (threads[i].data<>Null) Then
							If (threads[i].data.done=True) Then
								threads[i].data.t2=MilliSecs()
							EndIf
						EndIf
					
						'find vacant data
						For Local j:Int=0 To data.length-1
							If (data[j].active=False) And (data[j].done=False) Then
							
								data[j].active=True
								data[j].t1=MilliSecs()
								threads[i].state=NEW_DATA
								threads[i].data=data[j]	
								
								'Print "!"+j
								
								Exit
							
							EndIf
						Next
						
					EndIf
				
					UnlockMutex(threads[i].mutex)
					
				EndIf
			Next
			
			'if all data is done, exit this loop
			Local done:Int=True
			For Local i:Int=0 To data.length-1
				If (data[i].active=True) Or (data[i].done=False) Then
					done=False
				EndIf
			Next
			
			GCResume()
			
			Delay 1
			
			If (done=True) Then
				Exit
			EndIf
			
		Forever

		timer=MilliSecs()-timer
		WriteStdout " "+(CUR_SIZE/1000000)+" mb / "+(TARGET_MEM/1000000)+" mb in"
		WriteStdout " "+Left(String(timer/1000.0),4)+" seconds"
		
		Rem
		
		Local avg:Int=0
		Local longest:Int=0
		For Local i:Int=0 To data.length-1
			Local t:Int=(data[i].t2-data[i].t1)
			'WriteStdout "  "+t
			avg:+t
			If (t>longest) Then
				longest=t
			EndIf
		Next
		
		End Rem
		
		'WriteStdout "  avg: "+(avg/data.length)+", longest: "+longest'+"  "+data.length
		
		TIME[CUR_BLOCKS-1]:+timer

		'done
	
		WriteStdout "~n"
	
		Delay 1
		
		GCCollect()

	Forever
	
	GCCollect()
	
	For Local i:Int=0 To ITERATIONS-1
		Local a1:Float=Float(TIME[0])/Float(MAX_STEP)
		Local a2:Float=Float(TIME[i])/Float(MAX_STEP)
		WriteStdout "["+(i+1)+"] "+Int((a1/a2)*100.0)+"%~n"
	Next

	End

My results: The number in the brackets is the number of threads activated, along with the percentage increase over the single thread run(s). It runs a bit choppy, and seem to stall occasionally for some reason..


Phenom 2 X6 (ie 6 cores)

[1] 100%
[2] 186%
[3] 271%
[4] 263%
[5] 261%
[6] 341%
[7] 318%
[8] 315%
[9] 323%
[10] 321%
[11] 359%
[12] 373%
[13] 401%
[14] 395%
[15] 397%
[16] 394%

Last edited 2011

Last edited 2011

Last edited 2011

Last edited 2011

Last edited 2011

col	(Posted 2011) [#9]

Test1:

The 100000000th prime is 2038074743. ( total running time: 6.21400023 seconds )

Test2:
I get random EAVs. Sometimes it gets to 800, sometimes 1600, sometimes immediately after stating 'STARTING MAIN SIEVE'. It never finishes.
I set MaxThreads to 4 as I have a 2xCore.

Sony VAIO VGN-FW31M Vista Home SP2.

@AdamRedwoods
Did you try passing in objects by ref? or passing the whole object in?

zzz	(Posted 2011) [#10]

Yeah sorry about that piece of code. I rewrote the second test and eliminated two possible causes of segfaults. Seems to be stable for me at least now. Speed is about the same, slightly more then twice as fast as the first test.

zzz	(Posted 2011) [#11]

Also, is there some special reason that we cant use (i assume this, it seemed to casue me a lot of trouble) local variables in the threaded code? This hurts performance a lot.. Using vars from an object instance instead of locals in function scope make things over twice as slow for me in a simple test.

The second sieve will run in ~18s with just one thread and ~3s with a suitable amount of threads, which is a pretty ideal increase in performance. It bothers me however that the non-mt version does the same thing in ~6 seconds, which the mt version need three threads (with available cores) to match..

Last edited 2011

col	(Posted 2011) [#12]

works every time now....

done! 9.435 seconds.

found 100042089 primes. the 100000000th prime is 2038074743.

zzz	(Posted 2011) [#13]

well thats certainly odd :s i assume its the same computer as the times you posted before?

Czar Flavius

(Posted 2011) [#14]

Why can't you use local variables? The object reference itself will be stored in a local variable.

zzz	(Posted 2011) [#15]

Well im not sure, but every time i tried to rewrite the thread function to use locals instead of the instance fields i get segfaults without any obvious reason, just wondering if thats an error on my part or not, which i guess it is then.

xlsior

(Posted 2011) [#16]

done! 4.546 seconds.

found 100042089 primes. the 100000000th prime is 2038074743

(AMD II X4 640 Quad core 3GHz)

xlsior

(Posted 2011) [#17]

AMD II X4 640, 4 cores

[1] 100%
[2] 189%
[3] 265%
[4] 250%
[5] 268%
[6] 264%
[7] 287%
[8] 293%
[9] 296%
[10] 299%
[11] 313%
[12] 298%
[13] 309%
[14] 326%
[15] 320%
[16] 328%

col	(Posted 2011) [#18]

Yep, same computer. Did you not alter that code then?

Third test :-

  [01 / 16] 64 mb / 64 mb in 0.74 seconds
  [01 / 16] 64 mb / 64 mb in 0.71 seconds
  [01 / 16] 64 mb / 64 mb in 0.71 seconds
  [01 / 16] 64 mb / 64 mb in 0.72 seconds
  [01 / 16] 64 mb / 64 mb in 0.74 seconds
  [01 / 16] 64 mb / 64 mb in 0.71 seconds
  [01 / 16] 64 mb / 64 mb in 0.71 seconds
  [01 / 16] 64 mb / 64 mb in 0.71 seconds
  
  [02 / 16] 32 mb / 64 mb in 0.39 seconds
  [02 / 16] 32 mb / 64 mb in 0.37 seconds
  [02 / 16] 32 mb / 64 mb in 0.37 seconds
  [02 / 16] 32 mb / 64 mb in 0.38 seconds
  [02 / 16] 32 mb / 64 mb in 0.38 seconds
  [02 / 16] 32 mb / 64 mb in 0.37 seconds
  [02 / 16] 32 mb / 64 mb in 0.37 seconds
  [02 / 16] 32 mb / 64 mb in 0.38 seconds
  
  [03 / 16] 21 mb / 64 mb in 0.38 seconds
  [03 / 16] 21 mb / 64 mb in 0.38 seconds
  [03 / 16] 21 mb / 64 mb in 0.38 seconds
  [03 / 16] 21 mb / 64 mb in 0.38 seconds
  [03 / 16] 21 mb / 64 mb in 0.39 seconds
  [03 / 16] 21 mb / 64 mb in 0.38 seconds
  [03 / 16] 21 mb / 64 mb in 0.38 seconds
  [03 / 16] 21 mb / 64 mb in 0.44 seconds
  
  [04 / 16] 16 mb / 64 mb in 0.38 seconds
  [04 / 16] 16 mb / 64 mb in 0.38 seconds
  [04 / 16] 16 mb / 64 mb in 0.37 seconds
  [04 / 16] 16 mb / 64 mb in 0.38 seconds
  [04 / 16] 16 mb / 64 mb in 0.37 seconds
  [04 / 16] 16 mb / 64 mb in 0.37 seconds
  [04 / 16] 16 mb / 64 mb in 0.37 seconds
  [04 / 16] 16 mb / 64 mb in 0.37 seconds
  
  [05 / 16] 12 mb / 64 mb in 0.44 seconds
  [05 / 16] 12 mb / 64 mb in 0.39 seconds
  [05 / 16] 12 mb / 64 mb in 0.37 seconds
  [05 / 16] 12 mb / 64 mb in 0.37 seconds
  [05 / 16] 12 mb / 64 mb in 0.37 seconds
  [05 / 16] 12 mb / 64 mb in 0.37 seconds
  [05 / 16] 12 mb / 64 mb in 0.38 seconds
  [05 / 16] 12 mb / 64 mb in 0.38 seconds
  
  [06 / 16] 10 mb / 64 mb in 0.37 seconds
  [06 / 16] 10 mb / 64 mb in 0.37 seconds
  [06 / 16] 10 mb / 64 mb in 0.42 seconds
  [06 / 16] 10 mb / 64 mb in 0.37 seconds
  [06 / 16] 10 mb / 64 mb in 0.37 seconds
  [06 / 16] 10 mb / 64 mb in 0.38 seconds
  [06 / 16] 10 mb / 64 mb in 0.37 seconds
  [06 / 16] 10 mb / 64 mb in 0.38 seconds
  
  [07 / 16] 9 mb / 64 mb in 0.38 seconds
  [07 / 16] 9 mb / 64 mb in 0.38 seconds
  [07 / 16] 9 mb / 64 mb in 0.38 seconds
  [07 / 16] 9 mb / 64 mb in 0.39 seconds
  [07 / 16] 9 mb / 64 mb in 0.39 seconds
  [07 / 16] 9 mb / 64 mb in 0.39 seconds
  [07 / 16] 9 mb / 64 mb in 0.37 seconds
  [07 / 16] 9 mb / 64 mb in 0.38 seconds
  
  [08 / 16] 8 mb / 64 mb in 0.37 seconds
  [08 / 16] 8 mb / 64 mb in 0.37 seconds
  [08 / 16] 8 mb / 64 mb in 0.37 seconds
  [08 / 16] 8 mb / 64 mb in 0.37 seconds
  [08 / 16] 8 mb / 64 mb in 0.37 seconds
  [08 / 16] 8 mb / 64 mb in 0.37 seconds
  [08 / 16] 8 mb / 64 mb in 0.39 seconds
  [08 / 16] 8 mb / 64 mb in 0.37 seconds
  
  [09 / 16] 7 mb / 64 mb in 0.37 seconds
  [09 / 16] 7 mb / 64 mb in 0.38 seconds
  [09 / 16] 7 mb / 64 mb in 0.37 seconds
  [09 / 16] 7 mb / 64 mb in 0.38 seconds
  [09 / 16] 7 mb / 64 mb in 0.37 seconds
  [09 / 16] 7 mb / 64 mb in 0.37 seconds
  [09 / 16] 7 mb / 64 mb in 0.39 seconds
  [09 / 16] 7 mb / 64 mb in 0.39 seconds
  
  [10 / 16] 6 mb / 64 mb in 0.37 seconds
  [10 / 16] 6 mb / 64 mb in 0.37 seconds
  [10 / 16] 6 mb / 64 mb in 0.37 seconds
  [10 / 16] 6 mb / 64 mb in 0.37 seconds
  [10 / 16] 6 mb / 64 mb in 0.37 seconds
  [10 / 16] 6 mb / 64 mb in 0.37 seconds
  [10 / 16] 6 mb / 64 mb in 0.47 seconds
  [10 / 16] 6 mb / 64 mb in 0.37 seconds
  
  [11 / 16] 5 mb / 64 mb in 0.37 seconds
  [11 / 16] 5 mb / 64 mb in 0.37 seconds
  [11 / 16] 5 mb / 64 mb in 0.37 seconds
  [11 / 16] 5 mb / 64 mb in 0.37 seconds
  [11 / 16] 5 mb / 64 mb in 0.39 seconds
  [11 / 16] 5 mb / 64 mb in 0.37 seconds
  [11 / 16] 5 mb / 64 mb in 0.37 seconds
  [11 / 16] 5 mb / 64 mb in 0.37 seconds
  
  [12 / 16] 5 mb / 64 mb in 0.37 seconds
  [12 / 16] 5 mb / 64 mb in 0.37 seconds
  [12 / 16] 5 mb / 64 mb in 0.37 seconds
  [12 / 16] 5 mb / 64 mb in 0.37 seconds
  [12 / 16] 5 mb / 64 mb in 0.37 seconds
  [12 / 16] 5 mb / 64 mb in 0.37 seconds
  [12 / 16] 5 mb / 64 mb in 0.38 seconds
  [12 / 16] 5 mb / 64 mb in 0.37 seconds
  
  [13 / 16] 4 mb / 64 mb in 0.37 seconds
  [13 / 16] 4 mb / 64 mb in 0.37 seconds
  [13 / 16] 4 mb / 64 mb in 0.41 seconds
  [13 / 16] 4 mb / 64 mb in 0.39 seconds
  [13 / 16] 4 mb / 64 mb in 0.37 seconds
  [13 / 16] 4 mb / 64 mb in 0.37 seconds
  [13 / 16] 4 mb / 64 mb in 0.39 seconds
  [13 / 16] 4 mb / 64 mb in 0.37 seconds
  
  [14 / 16] 4 mb / 64 mb in 0.37 seconds
  [14 / 16] 4 mb / 64 mb in 0.37 seconds
  [14 / 16] 4 mb / 64 mb in 0.37 seconds
  [14 / 16] 4 mb / 64 mb in 0.37 seconds
  [14 / 16] 4 mb / 64 mb in 0.37 seconds
  [14 / 16] 4 mb / 64 mb in 0.37 seconds
  [14 / 16] 4 mb / 64 mb in 0.37 seconds
  [14 / 16] 4 mb / 64 mb in 0.37 seconds
  
  [15 / 16] 4 mb / 64 mb in 0.37 seconds
  [15 / 16] 4 mb / 64 mb in 0.37 seconds
  [15 / 16] 4 mb / 64 mb in 0.37 seconds
  [15 / 16] 4 mb / 64 mb in 0.37 seconds
  [15 / 16] 4 mb / 64 mb in 0.37 seconds
  [15 / 16] 4 mb / 64 mb in 0.37 seconds
  [15 / 16] 4 mb / 64 mb in 0.38 seconds
  [15 / 16] 4 mb / 64 mb in 0.37 seconds
  
  [16 / 16] 4 mb / 64 mb in 0.37 seconds
  [16 / 16] 4 mb / 64 mb in 0.37 seconds
  [16 / 16] 4 mb / 64 mb in 0.37 seconds
  [16 / 16] 4 mb / 64 mb in 0.37 seconds
  [16 / 16] 4 mb / 64 mb in 0.37 seconds
  [16 / 16] 4 mb / 64 mb in 0.37 seconds
  [16 / 16] 4 mb / 64 mb in 0.38 seconds
  [16 / 16] 4 mb / 64 mb in 0.37 seconds
  
[1] 100%
[2] 188%
[3] 184%
[4] 191%
[5] 186%
[6] 187%
[7] 187%
[8] 191%
[9] 190%
[10] 186%
[11] 191%
[12] 192%
[13] 187%
[14] 193%
[15] 191%
[16] 193%

zzz	(Posted 2011) [#19]

No the code is pretty much the same. The only thing i changed except a bug fix was to keep threads more persistent instead of spawning a new thread for every block of data. It didnt seem to affect anyone else :s

You could try changing this stuff at the top of the code:

Const SEGMENTSIZE:Int=2*3*5*7*11*13*17
Global ArrSkipValues:Int[]=[2,3,5,7,11,13,17]

Const SEGMENTSIZE:Int=2*3*5*7*11*13
Global ArrSkipValues:Int[]=[2,3,5,7,11,13]

Other than that i cant think of anything right now :/

Brucey

(Posted 2011) [#20]

Seems to scale nicely:

Sieving range 0..2038074744 in 1997 segments.
.......
done! 1.570 seconds.

found 100042089 primes. the 100000000th prime is 2038074743

zzz	(Posted 2011) [#21]

@brucey: would you care to post your result on the first sieve and system specs? just out of curiosity :p

Brucey

(Posted 2011) [#22]

First results were for 24 threads.

Test 1 results :

The 100000000th prime is 2038074743. ( total running time: 5.05499983 seconds )
Done!

Best I've had for the above is about 5.02.

Specs...
I think it's 4 x 6 core Intel Xeon X5650 @ 2.67GHz
...and more RAM than you can throw a stick at... but our BlitzMax apps can only see a snippet of that, unfortunately...

zzz	(Posted 2011) [#23]

Ok thanks :p Was guessing a server-geared cpu. Actually what number of threads did you use? Id have expected that to run even faster.

Brucey

(Posted 2011) [#24]

24 threads... but there's a lot of other stuff going on at the same time (2 Oracle instances, etc).

CPU shows about 1100% on average.

zzz	(Posted 2011) [#25]

Ok, i figured out what i was doing wrong with the threaded function in the second sieve. The actual sieving code in both programs is now identical.

This is probably a bit much to ask for, but ill give it a try. If you just want to see some times, then just run both pieces of code (adjust number of threads!) and post the results. If you got some spare time then please read these instructions:

INSTRUCTIONS: start with the second sieve and read the comments at the top. When you have found a suitable segment size for your cpu, go to the first sieve and at the start of the TSoE object, there is the same calcuation being done. Modify this so both sieves has the same value. Now run the benchmarks and post some times :)

threaded sieve:


	SuperStrict

	Framework BRL.StandardIO
	Import BRL.Threads
	Import BRL.LinkedList
	Import BRL.Retro

	' /////////////////////////////////////////////////////////////////////
	
	' !!! LOOK HERE !!! 
	'
	' first off, please set THREADS_NUM to 2x the cores you have on your
	' system. increasing it even more will probably give a slight performance
	' increase, but i want some coherency in the results.
	
	Const THREADS_NUM:Int=12
	
	' this is important for optimal performance. if you have the time, 
	' please find out what the L1 cache size is on your cpu (google), and then
	' adjust the number multiplied with PRIMEFACTORBASE to get the
	' value close to your L1 cache size. this is a good rule of thumb, but some experimenting will probably give you a value with faster times.
	
	' the current SEGMENTSIZE used is written out as
	' segment memory when running the program.
	
	Const PRIMEFACTORBASE:Int=2*3*5*7*11*13'dont touch
	Global SEGMENTSIZE:Int=PRIMEFACTORBASE*8'change this
	
	' /////////////////////////////////////////////////////////////////////
	
	Global TARGET_VALUE:Int=2038074743+1
	Global TARGET_PRIME:Int=100000000
	Global DEBUG:Int=0
	Global PR_MUTEX:TMutex=CreateMutex()
	Global STEPVALUE:Int=250
	Global LASTSTEP:Int=0
	
	Global ArrSkipValues:Int[]=[2,3,5,7,11,13]
	Global ArrWheel:Byte[SEGMENTSIZE]

	If (SEGMENTSIZE>TARGET_VALUE) Then
		TARGET_VALUE=SEGMENTSIZE*2
	EndIf

	' /////////////////////////////////////////////////////////////////////

	Type TThreadInterface

		Field id:Int
		Field mutex:TMutex
		Field state:Int
		Field data:TData
		Field flags:Byte[SEGMENTSIZE]
		
		Field i:Int
		Field j:Int
		Field k:Int
		Field t:Int
		Field base:Int
		Field limit:Int
		Field cur:Int

		Function Create:TThreadInterface()
			Local this:TThreadInterface=New TThreadInterface
			this.mutex=CreateMutex()
			this.state=0
			this.data=Null
			Return this
		End Function

	End Type

	Type TData

		Field id:Int
		Field active:Int
		Field done:Int
		Field primes:Int[]
		Field pri:Int
		Field start_value:Int
		Field range:Int
		
		Function Create:TData(pricount:Int, start:Int, range:Int)
			Local this:TData=New TData
			this.active=False
			this.done=False
			this.primes=New Int[pricount]
			this.pri=0
			this.start_value=start
			this.range=range
			Return this
		End Function

	End Type

	' /////////////////////////////////////////////////////////////////////

	' IMPORTANT!! this function will not include 2 among the generated
	' primes! this doesnt matter until one would want to start analyzing
	' the results..

	Function SieveBase:Int[]()
		'set up flags and apply wheel mask
		Local flags:Byte[SEGMENTSIZE]
		MemCopy(Varptr flags[0], Varptr ArrWheel[0], SEGMENTSIZE)
		Local count:Int=0
		'go over all values in range
		For Local i:Int=1 To SEGMENTSIZE-1
			If (flags[i]=0) Then
				count:+1
				Local v:Int=1+(i Shl 1)
				Local t:Int=i+v
				If (t<SEGMENTSIZE) Then
					Repeat
						flags[t]=1
						t:+v
					Until (t>=SEGMENTSIZE)
				EndIf
			EndIf
		Next
		'done, gather primes
		Local primes:Int[count+ArrSkipValues.length]
		Local pri:Int=0
		For Local i:Int=0 To ArrSkipValues.length-1
			primes[i]=ArrSkipValues[i]
			pri:+1
		Next
		For Local i:Int=1 To SEGMENTSIZE-1
			If (flags[i]=0) Then
				primes[pri]=1+(i Shl 1)
				pri:+1
			EndIf
		Next
		'done
		Return primes
	End Function

	' the function used for threads

	Const THREAD_IDLE:Int=1
	Const THREAD_BUSY:Int=2
	Const THREAD_TERMINATE:Int=3

	Function ThreadFunc:Object(o:Object)
	
		Local thread:TThreadInterface=TThreadInterface(o)
		
		If (thread=Null) Then
			Return Null
		EndIf
	
		Local flags:Byte[SEGMENTSIZE]
	
		Repeat
		
			If (TryLockMutex(thread.mutex)=True) Then
			
				Select (thread.state)
				
				Case THREAD_IDLE
			
					'..
					
				Case THREAD_BUSY
					
					If (thread.data<>Null) And (thread.data.done=False) Then
						
						' add wheel to flags
						MemCopy(Varptr flags[0], Varptr ArrWheel[0], SEGMENTSIZE)
						
						' precalc stuff
						Local limit:Int=thread.data.start_value+thread.data.range
						limit=limit Shl 1
						
						For Local i:Int=0 To base_primes.length-1
						
							Local base:Int=base_primes[i]
							If (base*base>limit) Then
								Exit
							EndIf
						
							Local cur:Int=base Shr 1
							Local t:Int=thread.data.start_value-cur
							
							cur:+(t-(t Mod base))
							cur:-thread.data.start_value
							
							'this piece of code seems to take A VERY LONG TIME to
							'execute with the threaded gc..
							
							If (cur<0) Then
								cur:+base
							EndIf
							
							While (cur<SEGMENTSIZE)
								flags[cur]=1
								cur:+base
							Wend
							
						Next
						
						Local p:Int=0
						
						For Local i:Int=0 To flags.length-1
							If (flags[i]=0) Then
								thread.data.primes[p]=1+((thread.data.start_value+i) Shl 1)
								p:+1
							EndIf
						Next
						
						' thread done
						thread.data.pri=p
						thread.data.done=True
					
					EndIf		
					
					thread.state=THREAD_IDLE
					
				Case THREAD_TERMINATE
					
					UnlockMutex(thread.mutex)
					Return Null
					
				Default
			
					thread.state=THREAD_IDLE
			
			
				End Select
			
				UnlockMutex(thread.mutex)
			
			EndIf
	
			Delay 1
		
		Forever
	
		Return Null
	
	End Function

	Function InitWheel()
		For Local i:Int=0 To SEGMENTSIZE-1
			Local v:Int=1+(i Shl 1)
			For Local j:Int=0 To ArrSkipValues.length-1
				If (v Mod ArrSkipValues[j]=0) Then
					ArrWheel[i]=1
					Continue
				EndIf
			Next
		Next
	End Function

	' /////////////////////////////////////////////////////////////////////
	
	Print ""
	Print "segment memory: "+String(SEGMENTSIZE/1000)+" kb"
	Print "threads: "+THREADS_NUM
	
	InitWheel()
	
	' get the threads up and running

	Global threads:TThreadInterface[THREADS_NUM]

	For Local i:Int=0 To threads.length-1
		threads[i]=TThreadInterface.Create()
		threads[i].id=i
		CreateThread(ThreadFunc,Object(threads[i]))
	Next

	' set up segments
	
	Global segments:TData[(((TARGET_VALUE/2)+1)/SEGMENTSIZE)+1]
	Global base_primes:Int[]=SieveBase()

	For Local i:Int=0 To segments.length-1
		segments[i]=TData.Create(base_primes.length, i*SEGMENTSIZE, SEGMENTSIZE)
		segments[i].id=i
	Next
	
	segments[0].done=True
	segments[0].primes=base_primes
	segments[0].pri=base_primes.length-1
	
	' remove wheel primes from base primes (?)

	If (DEBUG=1) Then
		Print ""
		Print "base primes array content:"
	EndIf

	base_primes=base_primes[..base_primes.length-ArrSkipValues.length]
	For Local i:Int=ArrSkipValues.length To segments[0].primes.length-1
		base_primes[i-ArrSkipValues.length]=segments[0].primes[i]
		If (DEBUG=1) Then
			Print (i-ArrSkipValues.length)+": "+base_primes[i-ArrSkipValues.length]
		EndIf
	Next
	
	' debug
		
	If (DEBUG=1) Then
		Print ""
		Print "true base primes:"
		For Local i:Int=0 To segments[0].primes.length-1
			Print segments[0].primes[i]
		Next
	EndIf

	' /////////////////////////////////////////////////////////////////////
	
	' sieve the rest
	
	Global SEGMENTS_DONE:Int=1
	Global TIME:Int
	
	WriteStdout "sieving range 0.."+TARGET_VALUE+" in "+segments.length+" segments."
	
	TIME=MilliSecs()
	
	Repeat
	
		' find vacant threads
		
		For Local i:Int=0 To threads.length-1
			If (TryLockMutex(threads[i].mutex)=True) Then
			
				If (threads[i].state=THREAD_IDLE) Then
				
					' thread had data assigned?
					
					If (threads[i].data<>Null) Then
						threads[i].data=Null
						SEGMENTS_DONE:+1
						If (DEBUG=1) Then
							'Print "segment done!"
						EndIf
					EndIf
				
					' find new segment for thread
					
					For Local j:Int=0 To segments.length-1
						If (segments[j].active=False) And (segments[j].done=False) Then
						
							segments[j].active=True
							threads[i].data=segments[j]
							threads[i].state=THREAD_BUSY
							
							If (DEBUG=1) Then
								'Print "assigned new segment"
							EndIf
							
							Exit
						
						EndIf
					Next
				
				EndIf
			
				UnlockMutex(threads[i].mutex)
			
			EndIf
		Next
	
		If (SEGMENTS_DONE-LASTSTEP>=STEPVALUE) Then
			WriteStdout "."
			LASTSTEP:+STEPVALUE
		EndIf
	
		If (SEGMENTS_DONE=segments.length) Then
			Exit
		EndIf
	
		Delay 1
		
	Forever
	
	TIME=MilliSecs()-TIME
	
	Print "~ndone! "+Left(String(TIME/1000.0),5)+" seconds."

	Rem
	get results
	End Rem
	
	Local count:Int=0
	Local prime:Int=0
	
	For Local i:Int=0 To segments.length-1
		If (count+segments[i].pri>TARGET_PRIME) Then
			'target prime is in this segment
			Local t:Int=TARGET_PRIME-count-1
			prime=segments[i].primes[t-1]
		EndIf
		count:+segments[i].pri
	Next
	
	Print ""
	Print "found "+count+" primes. the "+TARGET_PRIME+"th prime is "+prime

	Rem
	End Rem

	End

original sieve:




	Rem
	
	this version uses ~520mb of memory For 100m primes, While the
	bit array version uses ~140mb. The speed difference is quite big
	tho, 8s compared to ~15s, almost twice as fast..
	
	End Rem

	SuperStrict

	Framework brl.standardio
	Import brl.linkedlist
	Import brl.blitz
	Import brl.random
	Import brl.math
	Import brl.threads

	Rem
	
	perform the benchmark test, calculating 100m primes and displaying the
	resulting times..
	
	End Rem

	Print ""

	Local t:Int=MilliSecs()

	For Local i:Int=1 To 20
		Local index:Int=i*5000000
		Local prime:Int=TSoE.GetPrime(index)
		't=MilliSecs()-t
		Print "The "+index+"th prime is "+prime+". ( total running time: "+((MilliSecs()-t)/1000.0)+" seconds )"
	Next

	Print "Done!"

	End

	Rem
	End Rem

	Type TSoE

		'///////////////////////////////////////////////
		
		'SEE HERE!

		Const PRIMEFACTORBASE:Int=2*3*5*7*11*13
		Global SEGMENTSIZE:Int=PRIMEFACTORBASE*8
		
		'///////////////////////////////////////////////
		
		Global Init:Int=False
		Global arrSkipValues:Int[]=[2,3,5,7,11,13]
		Global arrWheel:Byte[SEGMENTSIZE]

		Global listCullValues:TList
		Global Primes:Int
		Global BiggestPrime:Int
		Global CurrentSegment:Int

		Rem
		End Rem

		Function Initialize:Int()
			If (TSoe.Init=False) Then
				'init wheel
				For Local i:Int=0 To TSoE.SEGMENTSIZE-1
					Local v:Int=1+(i Shl 1)
					'WriteStdout v
					For Local j:Int=0 To TSoe.arrSkipValues.length-1
						If (v Mod TSoE.arrSkipValues[j] = 0) Then
							'WriteStdout "!"
							TSoE.arrWheel[i]=1
							Continue
						EndIf
					Next
					'WriteStdout "~n"
				Next
				'init misc
				TSoE.listCullValues=New TList
				TSoE.Primes=0
				TSoE.BiggestPrime=0
				TSoE.CurrentSegment=0
				'done
				TSoE.Init=True
				Return True
			EndIf
			'no init done if object was already initialized
			Return False
		End Function

		Rem
		End Rem

		Function Sieve:TSoECull(segment:Int)
			'create cull object and flags
			Local cull:TSoECull=TSoECull.Create(segment*TSoE.SEGMENTSIZE, TSoE.SEGMENTSIZE)
			Local flags:Byte[TSoE.SEGMENTSIZE]
			'copy wheel to flags
			MemCopy(Varptr flags[0], Varptr TSoE.arrWheel[0], TSoE.SEGMENTSIZE)
			'sieve
			If (segment=0) Then
				'----------------------------------------------------
				'first segment
				TSoE.Primes:+TSoE.arrSkipValues.length
				For Local i:Int=1 To TSoE.SEGMENTSIZE-1
					Local value:Int=1+(i Shl 1)
					If (flags[i]=0) Then
						Local temp:Int=i+value
						If (temp<TSoE.SEGMENTSIZE) Then
							Repeat
								flags[temp]=1
								temp:+value
							Until (temp>=TSoE.SEGMENTSIZE)
						EndIf
						cull._flags[cull._primes]=value
						cull._primes:+1
					EndIf
				Next
			Else
				'----------------------------------------------------
				'arbitrary segment
				Local value:Int=segment*TSoE.SEGMENTSIZE
				Local limit:Int=value+TSoE.SEGMENTSIZE
				Local limit2:Int=limit Shl 1
				'limit2=Sqr(limit2)
				Local o:TSoECull
				For o=EachIn TSoE.listCullValues
					Exit
				Next
				'For Local o:TSoECull=EachIn TSoE.listCullValues
					'Local done:Int=False
					For Local i:Int=0 To o._primes-1
						Local base:Int=o._flags[i]
						If (base*base>=limit2) Then
						'If (base>limit2) Then
							Exit
						EndIf
						Local cur:Int=base Shr 1
						Local t:Int=value-cur
						cur:+(t-(t Mod base))
						cur:-value
						If (cur<0) Then
							cur:+base
						EndIf
						While (cur<TSoE.SEGMENTSIZE)
							flags[cur]=1
							cur:+base
						Wend
					Next
					'If (done=True) Then
					'	Exit
					'EndIf
				'Next
				'considering the limiting values above the code is a bit oddly
				'written, but nothing to gain from rewriting it really..
				For Local i:Int=0 To TSoE.SEGMENTSIZE-1
					If (flags[i]=0) Then
						Local prime:Int=1+((value+i) Shl 1)
						cull._flags[cull._primes]=prime
						cull._primes:+1
					EndIf
				Next
			EndIf
			'done
			Return cull
		End Function

		Rem
		End Rem

		Function GetPrime:Int(index:Int, threads:Int=1)
			TSoE.Initialize()
			While (TSoE.Primes<index)
				Local cull:TSoECull=TSoE.Sieve(TSoE.CurrentSegment)
				TSoE.Primes:+cull._primes
				TSoE.listCullValues.AddLast(cull)
				TSoE.CurrentSegment:+1
				TSoECull.prime_ratio=(cull._range/cull._primes)-1
			Wend
			'prime can now be found in cull values.. find and
			'return it.
			Local prime:Int=0
			Local count:Int=TSoE.arrSkipValues.length
			For Local o:TSoECull=EachIn TSoE.listCullValues
				If (count+o._primes<index) Then
					count:+o._primes
				Else
					For Local i:Int=0 To o._primes-1
						count:+1
						If (count=index) Then
							Return o._flags[i]
						EndIf
					Next
				EndIf
			Next
			'shouldnt get here..
			Return -1
		End Function
	End Type

	Rem
	######################################################################
	End Rem

	Type TSoECull
	
		Global prime_ratio:Int=1
		Field _base:Int
		Field _range:Int
		Field _primes:Int
		Field _flags:Int[]
		
		Function Create:TSoECull(base:Int, range:Int)
			Local this:TSoECull=New TSoECull
			this._base=base
			this._range=range
			this._primes=0
			this._flags=New Int[this._range/TSoECull.prime_ratio+1]
			Return this
		End Function

	End Type

Last edited 2011

Last edited 2011

zzz	(Posted 2011) [#26]

phenom 2 x6 @ stock 2,8Ghz:

original sieve:

The 100000000th prime is 2038074743. ( total running time: 6.45400000 seconds )

threaded sieve:

segment memory: 240 kb
threads: 12
sieving range 0..2038074744 in 4242 segments.................
done! 1.197 seconds.

found 100005631 primes. the 100000000th prime is 2038074743

Process complete

Thats some very nice scaling i must say :) 539% increase in performance.

Last edited 2011

zzz	(Posted 2011) [#27]

Intel Core 2 T6400

original sieve:

The 100000000th prime is 2038074743. ( total running time: 9.16199970 seconds )

threaded sieve:

segment memory: 240 kb
threads: 4
sieving range 0..2038074744 in 4242 segments.................
done! 5.585 seconds.

found 100005631 primes. the 100000000th prime is 2038074743

164% increase, good enough i guess.

col	(Posted 2011) [#28]

T1:
The 100000000th prime is 2038074743. ( total running time: 7.46299982 seconds )

T2:
segment memory: 180 kb
threads: 4
sieving range 0..2038074744 in 5656 segments.......................
done! 4.218 seconds.

I found that using a value of 6 gave 180kb of cache and was the fastest by .1 sec :p
Other values slowed it slightly. It made the cooling fan come on too after several quick successions :D

Floyd

(Posted 2011) [#29]

Intel i7-2600, using 8 threads and PRIMEFACTORBASE*16

segment memory: 480 kb
threads: 8
sieving range 0..2038074744 in 2121 segments.........
done! 1.070 seconds.

found 100005631 primes. the 100000000th prime is 2038074743

NOTE: the time before changing 8 to 16 was 1.289 seconds.

The original test with same change from 8 to 16:

The 100000000th prime is 2038074743. ( total running time: 4.36199999 seconds )

This actually got a little slower! Time before changing 8 to 16 was 4.08400011 seconds.

In fact the time dropped to 4.053 when I changed 8 to 4.

Taron

(Posted 2011) [#30]

Sounds very exciting, but for the life of it I can't compile it... compiled and recompiled modules, which have brl.threads amongst them, but I keep getting: Identifier 'Tmutex' not found

I wonder, if I somehow missed something in that regard? Any suggestions would make me very happy. Sorry to bother you guys with it, too.

WAAhhh...me soooo stupid. Never mind. Got it! Thread Build, yes,yes... sorry about that.

As my punishment, here are the numbers I'm getting on my macpro 8core:
[1] 100%
[2] 199%
[3] 296%
[4] 383%
[5] 382%
[6] 386%
[7] 356%
[8] 399%
[9] 425%
[10] 441%
[11] 431%
[12] 490%
[13] 482%
[14] 435%
[15] 441%
[16] 462%

Last edited 2011