Would this multithread well?

BlitzMax Forums/BlitzMax Beginners Area/Would this multithread well?

JBR(Posted 2016) [#1]
Hi, have a perlin noise maker which takes about 10ms to do a frame. I'm currently making 180 frames, which loop perfectly, and store them ready for plotting.

I only change the image every 4 vsyncs, so I have 4*16.667 = 65ms

What I'm asking is whether this kind of thing multithreads well in the background. The image is 512x512 and it uses maybe another 300kb of data.

Would I find it hindering my main code due to too much data?

Thanks, Jim.


col(Posted 2016) [#2]
Sounds like a viable candidate for a second thread to generate the image. Assuming that you're rendering the image then maybe use 2 images with a double buffer type of approach - that would be to have one image being worked on and one in the main thread being rendered.


JBR(Posted 2016) [#3]
Hi Dave, what commands should I be looking at ... simplest approach. Jim


grable(Posted 2016) [#4]
Threading aint exactly simple, especially when you have to get data back from one without halting the main thread.

Heres one way of doing it, by using atomics:
SuperStrict

Type TWorker
	Field Result:TPixmap
	Field HasResult:Int
	
	Function func:Object( data:Object)
		Local worker:TWorker = TWorker(data)
		worker.Result = CreatePixmap( 512, 512, PF_RGBA8888)
		For Local i:Int = 0 Until worker.Result.Capacity Step 4
			worker.Result.Pixels[i+0] = Rand(256)
			worker.Result.Pixels[i+1] = Rand(256)
			worker.Result.Pixels[i+2] = Rand(256)
			worker.Result.Pixels[i+3] = 255
		Next
		While Not CompareAndSwap( worker.HasResult, False, True)
			Print "spinning"
		Wend
	EndFunction
EndType

Local worker:TWorker = New TWorker
CreateThread( worker.func, worker).Detach()

While Not CompareAndSwap( worker.HasResult, True, False)
	Print "waiting"
	Delay 1
Wend

Print "got data"
Normally you shouldnt be spinning on the wait condition like this (could have just use WaitThread() instead). But check this as often as you can in your mainloop.

And a better approach would also reuse the thread and do a swap between current and next pixmaps to save system resources, IE not allocate them each time.


Kryzon(Posted 2016) [#5]
Read the introduction to the BRL.Threads module, it's got some information:
https://en.wikibooks.org/wiki/BlitzMax/Modules/System/Threads

But most of those functions have zero explanation if you're coming at this without ever hearing about them. You'll have to consult Google or StackOverflow ("what is atomic", "what is a semaphore" etc.).

I think another way to check on the state of a thread, without having to resort to idle spinning, is to use TryLockMutex and WaitSemaphore:



EDIT: Added an extra check so the main loop doesn't lock the result mutex consecutively.


JBR(Posted 2016) [#6]
Thanks guys, I'm liking the simplicity of grable's code.

I have a question.

Why the need for CompareAndSwap

could I not just check worker.HasResult directly.

Jim


grable(Posted 2016) [#7]
could I not just check worker.HasResult directly.
Because there is no way to guarantee that the read and write happens in the order you expect.
In its simplest form, the write could be half done when trying to read.
But with all the different cache levels in a cpu, there is no telling in what state memory is in when using threads.

This is a problem that has increased with multiple cores, where two or more threads can read or write to the same address at the same time.

CompareAndSwap, AtomicAdd and AtomicSwap are atomic operations, meaning they are synced at the cpu level.
It is guaranteed that any reads and writes to the same memory address happens in a defined way.

I suggest reading up on Threading and Atomics as there are many many pitfalls that stand in the way of getting things working correctly ;)


JBR(Posted 2016) [#8]
ok but the actual Result:TPixmap is not guaranteed to be of the correct values when I modify it after the "got data" is printed.

Also, does the .Detach() just removes the thread when done?

Sorry for my beginner questions. Jim.


Kryzon(Posted 2016) [#9]
I should add that when I was studying threading in BlitzMax I learned a lot from the user tutorials:

- http://www.blitzbasic.com/Community/posts.php?topic=91458
- http://www.blitzbasic.com/Community/posts.php?topic=80677

They answer questions like what does the Detach method do.


col(Posted 2016) [#10]
Sorry to come in late here guys...

You can read and write to the same variable without locking when you completely understand the nuances of the Read-Modify-Write paradigm. In the case of simply exiting a worker thread that's looping then you can decide if it matters or not if the loop goes around once more because of the read being made while the variable was being updated. The problem comes with modifying the contents of address ( ie a variable ) isn't atomic but if you are relying on that it is. This is where understanding the Read-Modify-Write process is invaluable and the biggest advise to give would to be truly understand it. You should also read up on and understand 'out of order execution' too. Google will be your best friend here.

In BlitzMax there are a couple of multithread primitives that can be used. Multithreading is anything but simple and has to be planned out well to be robust. You will always need to think about how many threads can access what data and at what times and if any variables need to be protected from being accessed from more that 1 thread. It's always best to plan ahead with threading and ALWAYS cater for something that is extremely unlikely to happen - believe me if there's an unlikely chance that 2 or more threads could access the same data at the same time, no matter how slim that chance is then don't take the chance.

Multithreading solutions can still be done in is many different ways as it is when you're single threading.

Enough of the lecture :D on with something practical...
From what you describe I'd write something that's similar to this:

1. Create a pool of TImages. As long as you don't DrawImage in the second thread and only DrawImage in the main thread then using TImages is perfectly safe in other threads. TImage use a TPixmap for the underlying pixel data.
2. Create a multithread safe queue that can be accessed by both threads. You'll want the thread thats creating the images to wait if the queue is full but the thread taking images is allowed to carry on if the queue is empty.

The flow of code would be this...
You have pool of existing empty TImages, it can be any number of images, but due to how I've written the code then you need a number of images that are of a power of 2. You allow only one image to be taken from the pool and you use an index to know which image is next. You use a 'round robin' index so that you can keep using the same images/pixmaps.
You have a multithread queue that holds the same number of items as there as TImages in the pool so that you take advantage of some multithread syncing constructs - in this case a semaphore.

In the main thread: query the queue to see if a TImage is available and ready. If no image is available then no image is returned ( Null for eg ), and if an image is available the DrawImage that TImage. The main thread is now also free to carry on doing something else whether it gets a new image or not from the queue. You need to cater for if an image is ready and also if it isn't ready. You never know if the OS will step for some arcane reason and prevent your code running as fast as you're expecting - this is very unlikely but these are the kinds of things that can trip up your code and break everything.

In the second thread you have it take an image from the image pool, do it's image manipulation and then put that image into the queue ready for the main thread to take when it wants it. When/if the queue is full then you have that thread sleep so that's its not hogging the cpu core for nothing. This also allow the OS to run much more smoothly and if the OS is running smooth then so it your code too.

Using that approach then brings the issue to a making sure that two threads can't access the same data at the same time in the queue.

Here's a working example of using 4 existing images that the queue rotates between. You could move the components around, for eg putting the image pool inside the queue but this approach is simply keep things as they 'should be' as in a pool of images is a pool, and the queue is just a queue. The queue is modified from a regular queue in that the thread that puts items into the queue will wait when the queue is full, however the thread taking items from the queue won't wait if an item is not available.




So there are a couple of multithread 'constructs' in there.
A mutex is a special object that can restrict other threads from entering the same code path at the same time. This makes a section of code 'mutually exclusive', ie there can be only one, which refers to the number of threads accessing that code path. People often talk of using a mutex to protect a variable,this is confusing as its not actually what a mutex does. While a mutex is locked then no other thread can lock it and any attempt to try to lock it will make that thread wait until the mutex is unlocked. So the real use of a mutex is to create a critical section of code that can only be accessed by one thread. You then take advantage that only one thread can be inside a critical section of code to update your variables that you want to guarantee are updated correctly. There are things you can do so that the thread doesn't wait if the mutex is locked already for eg using the TryLockMutex function. TryLockMutex will not make the thread wait if the mutex is already locked but that thread isn't allowed into the critical section of code either.

Also there is a semaphore used in there too. A semaphore is like a counter. You create a semaphore with an integer value that is its counter value. Each time you call WaitSemaphore then its internal counter will decrement. If the value is NOT zero then the WaitSemaphore function will let the thread continue execution. If the semaphore internal value does hit zero then the thread is put to sleep forever waiting until the value is no longer zero. When you PostSemaphore then the internal counter is incremented and if the value was zero then only one single thread that was waiting for the increment will be woken and can carry on its execution. If the case of multiple threads waiting on a single semaphore then the OS will decide which thread to wake up. The operation of updating its own internal counter is atomic which allows you call WaitSemaphore and PostSemaphore from any threads - careful planning needs to be done to take advantage of this construct and when used correctly is very powerful.


Silver_Knee(Posted 2016) [#11]
Hi,
I didn't understand: do you generate them once and use them later or do you generate one or 180 every 65 seconds?

If you want to generate them every 65 seconds you could use the backbuffer / frontbuffer idea with that. Having two pixmaps that don't interfere with eachother. You need 2 semaphores so if the worker is fast it will not produce endless frames.

Type TWorker
  Field pixmap:TPixmap[2] 'back and front buffer
  Field usePixmap:Int 'defines if pixmap[0] or pixmap[1] is the backbuffer 
  Field isDone:Int 'style points
  Field producerSemaphore:TSemaphore=CreateSemaphore()
  Field consumerSemaphore:TSemaphore=CreateSemaphore()

  Method New()
    pixmap[0]=CreatePixmap( 512, 512, PF_RGBA8888)
    pixmap[1]=CreatePixmap( 512, 512, PF_RGBA8888)
  End Method

  Method GetNextPixmap:TPixmap()
    producerSemaphore.Wait() 'this will wait until the work is done
    
    Repeat
      Local returnPixmap=usePixmap
    Until CompareAndSwap(usePixmap,returnPixmap,Not returnPixmap)
    
    consumerSemaphore.Post() 'release the worker
    
    Return pixmap[returnPixmap]
  End Method

  Function Work:Object(data:Object)
    Local woker:TWorker=TWorker(data)
    
    While Not worker.isDone
      For Local i:Int = 0 Until worker.Result.Capacity Step 4
        worker.pixmap[usePixmap].Pixels[i+0] = Rand(256)
        worker.pixmap[usePixmap].Pixels[i+1] = Rand(256)
        worker.pixmap[usePixmap].Pixels[i+2] = Rand(256)
        worker.pixmap[usePixmap].Pixels[i+3] = 255
      Next
      
      producerSemaphore.Post() 'release GetNextPixmap
      consumerSemaphore.Wait() 'now we wait for GetNextPixmap
    Wend
  End Function
End Type

Local worker:TWorker=New TWorker
CreateThread( func, worker )

'Main loop
Local currentPixmap:TPixmap
Local timeHasComeToGetANewPixmap:Byte=True
Repeat
  If timeHasComeToGetANewPixmap
    currentPixmap=worker.GetNextPixmap();
  EndIf
  '...
Forever


If you produce them once and use them later you can wait for all 180 frames to finish with one semaphore.

Type TWorker
  Global semaphore:TSemaphore
  Field pixmap:TPixmap

  Method New()
    pixmap=CreatePixmap( 512, 512, PF_RGBA8888)
  End Method

  Method GetPixmap:TPixmap()
    Return pixmap
  End Method

  Function Work:Object(data:Object)
    Local woker:TWorker=TWorker(data)
    
    For Local i:Int = 0 Until worker.Result.Capacity Step 4  
      worker.pixmap[usePixmap].Pixels[i+0] = Rand(256)
      worker.pixmap[usePixmap].Pixels[i+1] = Rand(256)
      worker.pixmap[usePixmap].Pixels[i+2] = Rand(256)
      worker.pixmap[usePixmap].Pixels[i+3] = 255
    Next
      
    semaphore.Post()
  End Function
End Type

Local workerCount=180
TWorker.semaphore=CreateSemaphore(180)
Local worker:TWorker[]=New TWorker[workerCount]

For Local i:Int=0 Until workerCount
  worker[i]=New TWorker
  CreateThread( TWorker.Work, worker[i] )
Next

TWorker.semaphore.Wait()


In general: you have a Producer / Consumer - Problem. Semaphores are just right for that: Post is used to signal one unit of work is done and wait will wait for all units of work to be done. How many units of work is defined by CreateSemaphore. You will need one semaphore for each thread that should be waiting.

In the case you have a "this bit of code should only be executed by one thread" problem, mutexes are the better solution.

Also using semaphores and mutexes are always better solutions than running in a loop checking for variables to change. They are - in theory - low-level implementations that will tell the processor for real that this thread will not be useful until that mutex will be unlocked or this semaphore reaches 0. So the scheduler can ignore these threads until it is done.
If the thing you are waiting for is a little more complicated and you really want to do some checking code, if all conditions are met and your thread can continue, the CondVars are your tool. They let you Wait until someone that knows that CondVar wakes your thread up - like "hey, something happend". You then can then check all your conditions and go to sleep again if they aren't met. Every waiting thread needs a mutex to use with the condvar. So you have a low-level waiting again and only check your parameters when it matters and not randomly every 10 milliseconds.

Greez

EDIT: Whoa i wrote that for an hour ^^ well many something like that can't be retold enough..


Kryzon(Posted 2016) [#12]
When you only use semaphores you're establishing that the main thread (the consumer) will wait for the producer threads to finish. The operation will not be asynchronous.
The main thread could do other processing in the mean time if you're talking about a real-time game or if the application has a GUI and you want it to be responsive and not "block" while the work is being done -- if you're using MaxGUI, if you block the main thread with a semaphore the application looks frozen.

With both a semaphore and a mutex you can use TryLockMutex, as it never locks the caller. When you use this for the main thread (consumer) you can let it process other things if the mutex is already locked by some other thread, as illustrated in post #5.


col(Posted 2016) [#13]
@Kryzon,

I was just experimenting with your code,
If you put a delay of say 2000 in the 2nd thread say at line 47 then you still get an immediate message that the pixmap is ready, however the pixmap can't be ready yet until the delay has passed. Am I missing something?


JBR(Posted 2016) [#14]
I'm afraid I'm all at sea atm.

Would it help if I published my animated perlin routine?

Jim


col(Posted 2016) [#15]
It certainly wouldn't hurt :-)


Kryzon(Posted 2016) [#16]
Hey col, good catch.
I think the problem comes from the fact that that code assumes that if he main thread (the consumer) can successfully lock the pixmap mutex of the producer then a pixmap will be ready, but that's not necessarily true.

So I imagine a better way to do it would be to have a "state" value that can be verified:




Derron(Posted 2016) [#17]
@Kryzon
Your code in #5 does not fail because the code is able to "(Try)LockMutex()" on a locked mutex .. if that would not work, I assume the threading code of BlitzMax would be borked.

I more likely tend (without testing) to say that this line is the culprit:
If semaphoreCount = 0 Then Return TryLockMutex( resultMutex )


You already recognized, that you need a custom "state" variable. Why?

This is your logic run:

a) create a worker and assign a function
b) loop until worker is finished


Now the important part:
a) creates a worker with an _unlocked_ mutex and detaches the thread
b) is run _parallel_ to the thread (no sequential chain anymore)!

CPUTick 000: a)
CPUTick 500: b) isReady()? worker.func() was not called yet
CPUTick 502: a1) worker.func() with LockMutex()

So instead of an "finished" state, you could even use a "WhileLoopCount" or whatever (whileRun > 0 ...)


I use that TMutex just to see whether I am able to modify the "mutexed" variable. I cannot see with that Mutex if something at least modified a variable _once_. It is just an "currently in use" property.


bye
Ron


JBR(Posted 2016) [#18]
Here is the animated perlin noise code. I should mention I adapted sswift's code from the archive, 2003 I think, and enhanced.

I'd like to see it run in the background and change the image every 4 frames.

Jim.




dw817(Posted 2016) [#19]
This is beautiful, JBR ! Here is how I would do it. And yes, I can SORT OF make a perline image, I did so on Commodore Amiga. Easier to just nab this image tho:




Brucey(Posted 2016) [#20]
Here is the animated perlin noise code

You can never have too many Globals :-p


grable(Posted 2016) [#21]
. double post


grable(Posted 2016) [#22]
Here you go :)

The generation time is relatively stable. The lowest=49, the highest=72 and avg=60. <--(was running under power saving)
And it only has to wait 1 more frame whenever it exceeds the time needed.
The generation time is not very stable, 12/13.
Im running on a 6700K though, your mileage may vary.

EDIT: Fixed bug that made it wait the full 4 frames when time exceeded.
EDIT2: Fixed another bug in Terminate() not terminating.




dw817(Posted 2016) [#23]
Just out of curiosity guyz, what minimal code for Blitzmax could there be to make a perfect seamless Perlin Noise screen, not animated, just a drawn field, say 512x512 pixels ?


Brucey(Posted 2016) [#24]
Stats are interesting here (using grable's code).

For Windows (7, running in a VM on my Mac - VM has 2 cores and 6GB), I get generation at 15/16 (min/max) for legacy BlitzMax. For 64-bit NG, I'm getting 4/5 (min/max). That's quite a bit faster.


Brucey(Posted 2016) [#25]
After some optimisations, I get 14 avg on legacy, and a constant 4 on NG 64-bit.
Optimisations generally involve working with pointers to the arrays, rather than the arrays themselves - which has some overhead.
Ideally one could also consider tweaking the code to assist the CPU cache.




grable(Posted 2016) [#26]
Forgot i was running in powersaving mode ;) Updated my figures above, 12/13.

Your sample though, jumps wildly for me Brucey. 10/48. I wonder why that is.
Was running the wrong sample, so both of them jump wildly. I wonder why its so stable on your end...
Your sample Brucey runs at 10/12. And with bmx-ng-x86 i get 3/4 and x64 3/3. Thats is pretty impressive :)

I did some tweaking on my own, but could only reduce it to 10/13. Removing compares and simplifying NoiseMap selection and transfer to pixmap.
I worked on the original though, before i ported it over to threaded one. And there it shaved off 25-30ms, under powersaving mode.

Probably has something to do with cache locality yeah. The thread doesnt have to share its part of the cache with the main thread, which may be why its hard to optimize the threaded version. Except for bmx-ng of course, gcc optimizes well compared to vanilla bmx ;)

EDIT: Juggling too many files at once leads to running the wrong thing :(
EDIT2: Damn windows and its power modes, double :(

original

threaded



grable(Posted 2016) [#27]
I figured out why it was jumping so wildly, seems even the Balanced power mode in windows likes to jump like crazy between low and max mhz of the cpu.
Only High Performance gives a stable time, same as Brucey.
Of course the CPU then runs at max mhz all the time, so will generate more heat. But this explains the wild jumping at least.

And here i was under the impression that if under heavy load it would stay at max mhz (under Balanced), either windows sucks at it or running that sample wasnt heavy enough for it to care :/


Brucey(Posted 2016) [#28]
Of course the CPU then runs at max mhz all the time

Sticking a Delay 1 in here may help :
While Not CompareAndSwap( worker.HasResult, False, True); Wend

Otherwise it will run at 100% until hasResult is reset (which could be for 4 * 16ms or so?).


grable(Posted 2016) [#29]
Otherwise it will run at 100% until hasResult is reset (which could be for 4 * 16ms or so?).
Its pretty much guaranteed to be False. Since one has to GetResult() before RequestResult().

I was talking in general about the CPU, meaning High Performance makes the cpu stay at 4000mhz (max without turbo) no matter the load.
But after running with it like that for a while, it only increased 4 degrees(C) when idling so not a big deal after all ;)

I dont have the mightiest of coolers, and under heavy load for several hours it can hit 68.
Which was why i was a bit worried.. Should really get a better cooler though, one with with a way larger heatsink hehe.


Brucey(Posted 2016) [#30]
Its pretty much guaranteed to be False

Oh yeah. Never mind ;-)

My 6500k is buried behind a 5k screen. I've never even heard the fan yet... although after some more months gathering dust I expect to hear it more often.


Derron(Posted 2016) [#31]
Brucey:
I've never even heard the fan yet

You know - the older, the less you hear.
Uhm ... seriously: you just wanted to use the chance to mention that 5k screen ... :-p



I assume GCC optimizes out these local variables (as they are only needed once each, which means we could write them all in one big formula)
						Local ICx# = 1.0 - cosine#[ Int(Ix# * 1024.0 ) ]					'Local ICx# = 1.0 - ((Cos(Ix#*180.0) + 1.0) / 2.0)

						Local Na# = N1#*(1.0-ICx#)
						Local Nb# = N2#*ICx#
						Local Nc# = N3#*(1.0-ICx#)
						Local Nd# = N4#*ICx#
						
						HeightMap#[Hx+Height_X +  (Hy+Height_Y)*513] :+ (Na#+Nb#)*(1.0-ICy#) + (Nc#+Nd#)*ICy#

So this might shape of some nano seconds too (in vanilla, when done as 1 formula).


bye
Ron


col(Posted 2016) [#32]
@grable
I've always found the power options too misleading for profiling too so now I have that power profile set to 'performance' and adjust the power saving options within that profile.



And my lonesome take on it using a multithread queue...

intel i5 3570 3.4ghz
gfx 750

Vanilla 14ms
NG 5ms




Derron(Posted 2016) [#33]
Col's code:
Vanilla: ~44
NG: ~30

PC: AMD (LLano quadcore, some years old :-))
OS: Linux Mint 64bit
GCC: 4.8.4


@Print
Doesn't "print" delay things a bit ?

Also: you should avoid "print" from within threads - as they might interfer with other print-calls. In the example above this is not used "critically" but I just want to make you aware of this.


bye
Ron


col(Posted 2016) [#34]
@Print

As you know, not in this example as it's outside the code path being timed and there's plenty of time left in that thread, but yes writing to a real console window is dog slow and not thread safe, however within the max editor is seems to be very fast here. Yes your point is very valid.


Brucey(Posted 2016) [#35]

Vanilla: ~44
NG: ~30


As you can see, GCC's optimisations work better on more modern architectures. In this example an average of x3, which is not insignificant.


col(Posted 2016) [#36]
I've updated my example to show the image generation time taken in the 2nd thread, and also the cost to the main thread for that image - ultimately the main thread can carry on about its business with as-good-as zero cost.


As you can see, GCC's optimisations work better on more modern architectures. In this example an average of x3, which is not insignificant.


On top of which gcc is also outputting some simd vector optimized code in places allowing it to blow vanilla bmax out of the water - nice :-)


Brucey(Posted 2016) [#37]
gcc is also outputting some simd vector optimized code in places

NG's array data is now 16-byte aligned. Dunno if that helps any?


JBR(Posted 2016) [#38]
Thanks for the help!

Looks like I should be writing code in 'NG' - what is it?

Newbie question - why does the time taken vary so much from 11 to 46 on my PC?

Jim


grable(Posted 2016) [#39]
why does the time taken vary so much from 11 to 46 on my PC?
See my previous posts concerning power saving in Windows. Might be related.


JBR(Posted 2016) [#40]
Hi,

Seems like NG is a bit tricky to set up!

I decided to use col code as it was a bit neater in the main code bit.

It's nice seeing it run without any extra processing in main loop.

Thanks everyone who helped.

Jim.