Threads and Cores

BlitzMax Forums/BlitzMax Programming/Threads and Cores

Brucey

(Posted 2007) [#1]

A curiosity question while I've been looking at multi-threading and BlitzMax....

If I had a multi-core/processor architecture, would threads run on a single core or be intelligent enough to use other/idle cores?
Or does this level of utilization require extra programming on top of simply creating and running threads (ie. "hmmm, core 3 isn't busy at the moment, so I'll run thread X on that one") ?

:-)

ImaginaryHuman

(Posted 2007) [#2]

If the cpu has more than one core, the operating system should be designed to support it and it is up to the o/s to decide which core would be used for which threads to run. It's not quite the same as symmetric multi processing where you have more than one physical CPU but it's pretty similar. So long as you are then using the o/s's thread system, ie they are `official` threads, then the o/s should manage them for you and give them time on both cores simultaneously. That is, provided some other tasks are not hogging one or both of the cores in some random fashion. I doubt there is anything you can do to deliberately ask for something to be run on a specific core/cpu but maybe there is. I would be very surprised if you as the programmer ever had to deal with how to allot execution time to a given core or cpu. The o/s is meant to handle it.

ImaginaryHuman

(Posted 2007) [#3]

FlameDuck

(Posted 2007) [#4]

Like AngelDaniel said, if the scheduler supports "kernel threads" (as opposed to "user threads") it should be able to utilize the second core/HyperThreading/extra CPU.

Winni

(Posted 2007) [#5]

Several OSs also support "thread affinity", that means you can assign a thread to a specific processor. The Windows NT family is among those systems, and as the system administrator you can actually set the "affinity" of a process in the Windows task manager.

Usuallly all modern operating systems handle the load balancing for you, which means the system tries to distribute the work load evenly over the available processors/cores. It is nothing a programmer should have to care about directly.

However, all of this is only valid for symmetrical multiprocessor architectures, where the CPUs actually share the system bus AND memory. When you distribute the work load over multiple independent machines, the whole story becomes much more complex and requires a different software design with more synchronization and communication overhead.

The question is: When will BlitzMax support multithreading?

Azathoth

(Posted 2007) [#6]

The biggest problem I think is the GC isn't thread safe.

Brucey

(Posted 2007) [#7]

I don't see why it can't already?

I've been reading up on threading and the likes, and as far as I can tell, if you follow the "rules", there's no reason why it wouldn't work...

Unless Mark knows otherwise ? ;-)

Just spent a couple of hours coding a threading module, and it seems to run okay given my test app.

Executing:test_01
pi = 3.1415926535921401
count = 1000000000
time = 15396
threaded....
pi = 3.1415926535900098
count = 1000000000
time = 14077

Process complete

I think the difference is calculation is down to me.. so :-p
The threaded bit is running in 4 threads, on a single-core Athlon something or other, in Linux.
I'm hoping I can drop the exe on a 4-processor Box and it shows a bit better oomf.

Still learning about threading tho... first experience of it, so I'm sure I'm making plenty of mistakes.
Here's the test app.

SuperStrict

Framework BaH.Threads
Import BRL.StandardIO

Type config
	Field start:Int
	Field size:Int
End Type

Local numSteps:Int = 1000000000

Global aStep:Double = 1.0 / numSteps

Global sum:Double
Global count:Int

Local startTime:Long = MilliSecs()

For Local i:Int = 0 Until numSteps

	Local x:Double = (i + 0.5) * aStep
	
	sum:+ 4.0 / (1.0 + x * x)
	count:+1
Next

Local myPi:Double = aStep * sum

Print "pi = " + myPi
Print "count = " + count
Print "time = " + (MilliSecs() - starttime)

' Threaded....
Print "threaded...."

sum = 0
count = 0

Local threadCount:Int = 4

Global mutex:TThreadMutex = TThreadMutex.Create()

Local attr:TThreadAttr = TThreadAttr.Create()
attr.setDetachState(PTHREAD_CREATE_JOINABLE)

Local threads:TThread[] = New TThread[threadCount]

startTime = MilliSecs()

For Local i:Int = 0 Until threadCount

	Local tConfig:config = New config
	tConfig.start = i * (numSteps / threadCount)
	tConfig.size = numSteps / threadCount

	threads[i] = TThread.Create(attr, doCalculation, tConfig)

Next

attr.free()

For Local i:Int = 0 Until threadCount
	threads[i].join()
Next

myPi = aStep * sum

Print "pi = " + myPi
Print "count = " + count
Print "time = " + (MilliSecs() - starttime)

End

Function doCalculation:Object(arg:Object)

	Local tConfig:config = config(arg)
	Local mySum:Double
	Local x:Double
	Local myCount:Int

	For Local i:Int = tConfig.start Until tConfig.start + tConfig.size

		x = (i + 0.5) * aStep
	
		mySum:+ 4.0 / (1.0 + x * x)
		myCount:+1
	Next
	
	mutex.lock()
	sum:+ mySum
	count:+ myCount
	mutex.unlock()

	TThread.done()

End Function

Apologies for the mess.. but it's just a basic test.

Brucey

(Posted 2007) [#8]

Oooh.. it runs on OS X too... without any changes..

Executing:test_01
pi = 3.1415926535904264
count = 100000000
time = 15311
threaded....
pi = 3.1415926535896825
count = 100000000
time = 12083

...except that I removed a zero from the loop, cuz it's not a fast CPU and I didn't want to wait all night for it to finish.
Threading is much faster on this G4 in relation to the Athlon. Or it's an OS X thing...

FlameDuck

(Posted 2007) [#9]

Just spent a couple of hours coding a threading module, and it seems to run okay given my test app.

Only with manual Garbage collection, the garbage collector AFAIK can only see objects in the "main" thread - or on a single core CPU where there is no chance of concurrent access to the critical sector.

Threading is much faster on this G4 in relation to the Athlon.

Yes. Obtaining a mutex (in assembly) is much more costly on Intel architecture. RISC - it was the future in the 90's, and for some reason it still is.

Brucey

(Posted 2007) [#10]

no chance of concurrent access to the critical sector

But isn't that what mutex lock/unlock is all about? Stopping concurrent access?

Surely if you build your app properly for threading, the GC shouldn't be an issue?

But I'm no expert, so I'm likely to be talking out of my arse... ;-)

FlameDuck

(Posted 2007) [#11]

The BlitzMAX compiler does not generate code that obtains mutexes, and from what I understand, the GC is unable to track objects created in a thread that isn't the main thread (leading to in the best case weird memory leaks, in the worst case random and arbitrary crashes). Also the debugger is probably going to throw up a few fits of its own.

Surely if you build your app properly for threading, the GC shouldn't be an issue?

Well that depends on how you define "properly". In languages like Java and C# you have a mechanism (a monitor) that lets you define what part of your code is a "critical sector" - wait and notify in Java, lock and unlock (IIRC) in C#. A similar mechanism does not exist in BlitzMAX, and I don't see how such a mechanism can be bolted on (if you'll pardon the expression).

Brucey

(Posted 2007) [#12]

I see what you mean. I guess some thorough testing is in order to see where things go awry.

Interestingly, here are the results for Win32, using the same source :

Executing:test_01.exe
pi = 3.1415926535921401
count = 1000000000
time = 18527
threaded....
pi = 3.1415926535898211
count = 1000000000
time = 16165

Process complete

...on a P4.

If someone with a multi-core Windows could test this, I'm curious to see if it makes any difference : TEST_01 (40k)

Space_guy

(Posted 2007) [#13]

I tried this app on 3 different computers here.

1: Athlon 64 3200
time=18527 threaded=16165

2: Intel P4 3.00Ghz
time=16554 threaded=16492

3: Intel T2500 (early dualcore) 2.00Ghz
time=22216 threaded=15290

Perturbatio

(Posted 2007) [#14]

I ran it three times on my system, it was slower each time for the threaded version:

pi = 3.1415926535921401
count = 1000000000
time = 10578
threaded....
pi = 3.1415926535898211
count = 1000000000
time = 11047


pi = 3.1415926535921401
count = 1000000000
time = 10578
threaded....
pi = 3.1415926535898211
count = 1000000000
time = 11062

pi = 3.1415926535921401
count = 1000000000
time = 10578
threaded....
pi = 3.1415926535898211
count = 1000000000
time = 11125

Also, is there a reason pi is different between each version?

Brucey

(Posted 2007) [#15]

Also, is there a reason pi is different between each version?

I imagine, because the non-threaded code is in one loop, and the other is effectively 4 separate loops, from which one incurs a rounding error. Probably.

Wasn't too fussed about the calculation tho... more with the concept :-)

splinux

(Posted 2007) [#16]

Nice, i got:
pi=3.1415926535921401
count..
time=12519
threaded....
pi=3.1415926535898211
count..
time=5982

AMD X2 4200+ (dual core) with 1.5Gb of ram.

ImaginaryHuman

(Posted 2007) [#17]

So where we're seeing faster times for threaded, that means the two threads are running on separate cores simultaneously?

Brucey

(Posted 2007) [#18]

It has to be, especially with splinux's example. Otherwise I think the times would have to be fairly similar - having to be allocated shared time on a single core...

splinux

(Posted 2007) [#19]

Threads, if not too many, can improve performance even on a single core.
However, i got about 2x speed, which seems to be exactly due to the double core architecture.

Torrente

(Posted 2007) [#20]

The application closes right after it is finished... too fast for me to see the threaded result.

xlsior

(Posted 2007) [#21]

The application closes right after it is finished... too fast for me to see the threaded result.

Launch it in a DOS box, and the results will remain on the screen.

My results, on a Core 2 Duo E6600 (2.4GHz):

pi = 3.1415926535921401
count = 1000000000
time = 15528
threaded....
pi = 3.1415926535898211
count = 1000000000
time = 12735

So slightly faster, but not by much. :-?
and it looks like whatever calculations are happening here is much faster on the AMD, looking at the results for the 4200+ mentioned above.

Something a bit odd:

If I open the task manager and view the CPU usage history, it's next to nothing before running the program. then in single-threaded mode, XP shows about 50% usage on both cores rather than the expected 100% usage on a single core.
Once the second stage kicks in, both cores go to 100%.

splinux

(Posted 2007) [#22]

AMDs are way better than Intels.
I buy only them.
:)

splinux

(Posted 2007) [#23]

wrong post..

xlsior

(Posted 2007) [#24]

AMDs are way better than Intels.
I buy only them.

Before the core 2 Duo's came out: yes, AMD beat the pants off of Intel. These days, though in most situations the intels beat AMD again. Not always as this test shows, but in general Intel is in the lead.

FlameDuck

(Posted 2007) [#25]

BTW Brucey, if you're dead serious about this, I recommend you read up on some of the issues with multi-threading, including but not limited to the Dining Philosophers Problem.

Dreamora

(Posted 2007) [#26]

If you want faster speed, recompile with current MingW / GCC.
The calculation speed will raise drastically

BMs sub C speed is caused due to the totally outdated compiler and the missing processor capability support of P3 and newer.

xlsior

(Posted 2007) [#27]

If you want faster speed, recompile with current MingW / GCC. The calculation speed will raise drastically

Any pointers on how one goes at doing that?

FlameDuck

(Posted 2007) [#28]

Just download and install the newest MingW release from their website and rebuild all modules. Then wave goodbye to MaxGUI.

xlsior

(Posted 2007) [#29]

Just download and install the newest MingW release from their website and rebuild all modules. Then wave goodbye to MaxGUI.

Hm.. Guess that isn't a real option at this time, then. Hopefully BRL will make the switch on their end at some point. :-/

Dreamora

(Posted 2007) [#30]

You don't need to wave goodbye to maxgui. Just do not recompile it. So simply move out, recompile, move in again.

xlsior

(Posted 2007) [#31]

You don't need to wave goodbye to maxgui. Just do not recompile it. So simply move out, recompile, move in again.

OK, I might give that a shot.

I do wonder though, why BRL is still tied into the old MinGW version.

splinux

(Posted 2007) [#32]

@ Brucey:
The example you posted uses the BaH.threads module.
Where could i get it?
(I coded a thread module based on PThreads too, but I had some problems with mutexes)

Brucey

(Posted 2007) [#33]

It's just a demonstration at the moment. Something I threw together in a couple of hours to see for myself how threads might fit into the BlitzMax framework.

There are issues with the Max GC that I'm considering having a go at - namely that the GC uses a lot of static variables that would need wrapped up in locks/unlocks so that counts are correct all the time, etc...

Still, if you want to thread c/c++ stuff underneath BlitzMax, I'm sure that could work...

splinux

(Posted 2007) [#34]

Ok.

Dreamora

(Posted 2007) [#35]

why BM uses outdated MingW: Because MaxGUI seems to use the same classes for GUI as MingW itself.
thats the reason you get the name dublication error on compile ...

But we can hope that this changes as skid is rewritting maxgui on windows and hopefully will rename the gadget classes so we are not fixed anymore to unoptimized compilers.

DStastny

(Posted 2007) [#36]

Hey Brucey,

There are issues with the Max GC that I'm considering having a go at - namely that the GC uses a lot of static variables that would need wrapped up in locks/unlocks so that counts are correct all the time, etc...

I looked at this almost a year ago but found that even if you fixed the GC code it still wont work. The reference count manipulation on global variables is emitted by the compiler and it is not accessible to create locking mechanisms around the reference manipulations. The other problem is local variables live on the stack context of the thread. So the stack manipulations have to be thread aware. That is more solvable than the reference counting.

I think Mark would have done this long ago if it wasn't such a systemic problem. You are looking at massive performance hit with all the locking. In fact there is a fragment in the GC_Retain macro that does interlock increment/decrement on i386.

This is one of the reasons i have currently shelved BMAX. If I wanted to use threads, and i do, I had to drop to C/C++.

Maybe if someday it gets a thread safe implementation and a real debugger... DEBUGSTOP???? in 2007. Just to start my laundry list :)

Anyway not to take your thread off track you have implemented some great stuff.

I have used in C/C++ this GC as a replacement as it doesnt use reference counting so the code emitted by compiler could be ignored. http://www.hpl.hp.com/personal/Hans_Boehm/gc/

This is the one used by the D language as well as Mono the Open source .net implementation. Pretty fast but I think Max will have trouble with MAX programming style needing to take into count a thread safe GC. Marks implementation is tuned for specific programming style.

Good luck
Doug

FlameDuck

(Posted 2007) [#37]

You also need a method of preventing deadlocks.

degac

(Posted 2008) [#38]

@brucey

Just jumped in this thread looking for something different, but your module about threading is still in developement?

JoshK

(Posted 2008) [#39]

Is it possible to perform a call GCSuspend() during thread execution, and then call GCCollect() when the threads sync?

Kurator

(Posted 2008) [#40]

As far as I unterstand, you can do that, but as mentioned above - gc only notices objects that are created in the mainthread. So if any of your child-threads instances some objects, gc will never see them and you have a memory leak.

Ian Thompson

(Posted 2008) [#41]

Simple programs like this may work but eventually the underlying thread unsafe nature of the BM compiler would probably bring down larger programs, any automatically generated/destroyed data storage that BM compiler generates for example would need automatic locking to work between threads.

I would love to see threads but I fear there is only one guy who can implement them safely.

Muttley

(Posted 2008) [#42]

For the sheer hell of it (I know this is an old thread), here are the results from my Quad Core machine running Vista:

pi = 3.1415926535921401
count = 1000000000
time = 16083
threaded....
pi = 3.1415926535898211
count = 1000000000
time = 6817

The threaded version maxed out all 4 cores when it was running, the unthreaded didn't even max out one.

jtfrench

(Posted 2008) [#43]

Have you guys heard of OpenCL? It will be coming out with Snow Leopard (the latest version of Max OS X) and will address the issue of delegating threads throughout various cores and even GPUs. Apple is proposing it as an open standard, so if gets adopted, we could see this as a cross-platform solution relatively soon.

Brucey

(Posted 2008) [#44]

Yeah, it sounds great, and hopefully will be fairly transparent too - ie. you don't need to specifically code for it, and it falls back onto the CPU if the system can't support it.