Crash on mac and windows in memcpy with GC?

BlitzMax Forums/BlitzMax Programming/Crash on mac and windows in memcpy with GC?

ima747(Posted 2010) [#1]
There's been a gremlin in my code for a long time that I can't ignore any longer.

When a memcpy occurs (most notable when converting a pixmap's format) sometimes I get a crash. This seems to only happen with multithreaded compile, as before I moved my project to MT I didn't have this problem and nothing related to the texture creation is changed. However it doesn't just happen with threads, the crash will happen on the primary thread as well. Here's one snipet of an example crash log

Thread 0 Crashed: Dispatch queue: com.apple.main-thread
0 libSystem.B.dylib 0xffff1250 __longcopy + 80
1 libSystem.B.dylib 0xffff0876 __memcpy + 214
2 libGLImage.dylib 0x93dcfc93 glgProcessPixelsWithProcessor + 725
3 GLEngine 0x1368cd0a gleTextureImagePut + 1433
4 GLEngine 0x1368a490 glTexImage2D_Exec + 1427
5 libGL.dylib 0x914c245f glTexImage2D + 87
...

That crash occured while a texture was being generated from a pixmap. Similar crashes will occur when converting a pixmap format.

It seems largely connected to the garbage collector, as if I put a GCCollect right before the copy it tends to crash more frequently. Additionally when I wrap the function that the copy will happen in with GCSuspend and GCResume it tends to happen less... but doesn't stop completely (perhaps a collect is already running when the suspend is called which doesn't get interrupted?).

I tried turning the garbage collector to manual but then I started getting hanging...

I'm rather confused and am pretty much out of ideas. Any thoughts or suggestions?

This also seems to happen the most with pixmaps I get from brucy's freeimage mod but I can't confirm that it's just those pixmaps (and once they're in bmax pixmaps it shouldn't matter the source any way...)


slenkar(Posted 2010) [#2]
are you doing memcopy in the main thread?
(just an idea, I dont know if it really affects anything)


ima747(Posted 2010) [#3]
Doesn't matter where it happens, main or child. That crash above is specifically in the main (I figured it would be easier to manage things if it's in the main)


markcw(Posted 2010) [#4]
I would revert the project to single-threaded if possible...

It seems like a rare MT bug. It would be best if you could post simplified code that reproduces it.


ima747(Posted 2010) [#5]
working on finding the time to punch up a simplified example, but haven't found it yet, especially difficult since it's not an every time type bug, but a when the stars align and therefore the memory doesn't...

doing more testing on the PC I can confirm its exactly the sample crash, specifically it's in ConvertPixelsToStdFormat in ConvertPixels in Convert on a pixmap. Interestingly running it with the new 1.40 release with the MT debugger on mac when it crashes I get an array out of bounds exception. Combined with where it crashes (pixel.bmx, line 107) it appears to confirm my suspicion that under some circumstances the garbage collector (or something) will shift a memory block while it's being copied, this in turn puts the array out of whack and boom, crash.

Once again this happens on the primary as well as child threads on a MT app.

I can't revert the project to single threaded as there are some things that just aren't practical in a single thread and they're critical to my program (specifically background loading of pictures which can take a long time for a single large picture and I need to churn through LOTS while doing other things...)

I still suspect the garbage collector since it's the most likely thing to be causing a block of memory to get shuffled about...

I will try to punch up a simplified example and post in the bug reports but until then if anyone has any ideas I'd love to give them a shot...


Brucey(Posted 2010) [#6]
I would also suspect the GC... is it also possible that some gfx memory is being GC'd causing the GL memcopy to crash on occasion?
I vaguely remember there were some issues with the GC and OGL in places... can't remember if they were fixed - or if a particular fix has a knock-on effect.


ima747(Posted 2010) [#7]
I think the OGL connection is likely just random, as I will get the same crash with a strait TPixmap conversion or copy. It just happens to be copying the memory in the posted crash to opengl rather than to another pixmap.

That said I would be interested in the GC/OGL connection as perhaps there's something that can be gleaned related to this...


ima747(Posted 2010) [#8]
Here's a little sample, it's not exactly the same crash I'm seeing, but I think it's probably the same root cause... This is crashing on my mac as soon as I launch it.

SuperStrict


Function ConvertPicture:Object(in:Object) ' function to be spawned in a child thread
	Local pixm:TPixmap = LoadPixmap("sample.jpg") ' load a pixmap, the larger the picture the better
	If(pixm.format = PF_RGBA8888) Then Print "already PF_RGBA8888"
	Local anotherpixm:tpixmap = pixm.Convert(PF_RGBA8888) ' do the format conversion, crash could happen in here...
	Local yetanotherpixm:TPixmap = anotherpixm.copy() ' do a copy, this could also crash. This uses up more memory for yet more cleanup
	
	Return yetanotherpixm ' return value to let it be stored in ram for a bit
End Function



Local onConversion:Int = 1
Local convertThread:TThread = CreateThread(ConvertPicture, Null)
Print "Starting first conversion"

' loop until escape is pressed
While(True) ' repeat forever
	Local aPixm:object = ConvertPicture(Null) ' do a copy on the main thread as well for some memory retention and more ram thrashing
	
	Print GCMemAlloced() + " collected " + GCCollect() ' thrash the garbage collector to try to provoke a crash
	If(Not ThreadRunning(convertThread)) ' if the thread is done
		convertThread = CreateThread(ConvertPicture, Null) ' start it again
		onConversion:+1
		Print "conversion " + onConversion
	End if
Wend


specifically it crashes when the main thread goes to load the picture as well, without that it ran for a while without incident, but I will comment and let it run longer to see if I can get the exact same crash.


ima747(Posted 2010) [#9]
Had a power failure which set back the testing a bit. But after recovering if I try to run with the main thread convertpicture and GCCollect calls removed it crashes right away in debug mode... the main thread is doing a GCResume for some reason and the child is creating a new pixmap... however in non debug it seems to run just fine...

still very confusing

update:
if you call GCCollect too fast it seems like a mutex that blocks simultaneous GCCollect calls gets stuck and the app will just idle out... definitely something wacky going on with the garbage collector in MT


ima747(Posted 2010) [#10]
With the debugger enabled I get a recursive GC collect that seems to lock up the memory system. Doesn't happen without debug on... there's definitely some issues with the MT garbage collector.


ima747(Posted 2010) [#11]
I've opened a bug report thread at http://www.blitzbasic.com/Community/posts.php?topic=91117 in the hopes of getting some exposure to someone more intimately aware of the threading and GC systems as they're turning into quite a rats nest as I dig in from my perspective at least.

Still desperate for any ideas or suggestions of things to try.

Also curious can anyone else reproduce crashing or hanging on the sample in debug or regular mode? At this point I just want to know if I've gone totally insane or just partially.


jondecker76(Posted 2010) [#12]
I'm successfully using MT in my applications and may be able to help

The sample you provided, to me, seems odly formed and not a very good real-world example. For example, your "thread" is continually called like a function and doesn't really provide a big advantage in using it this way. I also find it odd that both your thread and main thread are constantly calling the same block of code - again, not a very good real world scenario.
I'd be interested in seeing a better example that more colsely resembles what is happening in your real application

On a side note, I have noticed some odd crashes with MT in cases where the existance of the thread was very short, or the life of a locked mutex was extremely short. Maybe try putting a small delay of 20ms or so at the end of the thread function and see if it improves.


ima747(Posted 2010) [#13]
The example is merely to demonstrate that there's an underlying problem, not to illustrate my usage. the reason the same block is called from the thread and the main thread is simply to abuse the memory faster and I didn't want to write 2 functions. I've done of a lot of playing with the example as well (such as putting the load outside the child thread and just doing converts, or making the child thread just loop converts forever so it's not constantly being relaunched, removing the main thread function call, etc.) sometimes things work, and then I'll run the same example with debug on and it will crash. Also if you move around the GCCollect call you will get different results. There's a fundamental problem since various more/less appropriate applications of multithreading will cause it.

Relating delay, I can get crashes when the main and child threads are running both for extended periods. However under some circumstances I can create a hang when 2 things appear to be racing to free at the same time, this would be related I believe to the garbage collector calling an application lock, perhaps when the application is already busy locking for a free... This is why I started the support thread, there's a lot of locking of various things in the core of the GC and it's all tangled up, and on top of that I think there's a problem like you mentioned with locking/unlocking too fast.

The real world scenario (haven't made a simplified example yet as it's VERY embeded in my programs flow) is a display starts, and a child thread is spawned to load pictures for use in the display (using freeimage to be precise so no it's not related to the graphics system only being accessable from the main thread). Sometimes everything works flawlessly. Sometimes It will crash right away, some times it will crash after processing 50 pictures, etc. It's very random...

Thank you for the feedback, I'll try peppering some things with delays and see if that has any effect.


jondecker76(Posted 2010) [#14]
I'm at work right now, but now that I think about it, I also have a pice of code which also involves some pixmap manipulation that I can get to run great, as well as crash randomly depending on where I lock and unlock a Mutex. I'll look at that piece of code tonight when I get home and see if we have some similarities


ima747(Posted 2010) [#15]
I will be in your debut just for looking Jon, I've got a serious case of the crazys from this and it's pretty vital I get it sorted out...

Here's a process sample from when I can get what I suspect is the double lock. I sent a different one to Brucy the other day to have a look at, and I believe there are some differences between the 2 (which again would imply that randomly too many/too fast locks = problems)

Call graph:
    2435 Thread_100498   DispatchQueue_1: com.apple.main-thread  (serial)
      2435 start
        2435 _start
          2435 main
            2435 -[NSApplication run]
              2435 -[NSApplication nextEventMatchingMask:untilDate:inMode:dequeue:]
                2435 _DPSNextEvent
                  2435 AEProcessAppleEvent
                    2435 aeProcessAppleEvent
                      2435 dispatchEventAndSendReply(AEDesc const*, AEDesc*)
                        2435 aeDispatchAppleEvent(AEDesc const*, AEDesc*, unsigned long, unsigned char*)
                          2435 _NSAppleEventManagerGenericHandler
                            2435 -[NSAppleEventManager dispatchRawAppleEvent:withRawReply:handlerRefCon:]
                              2435 -[NSApplication(NSAppleEventHandling) _handleCoreEvent:withReplyEvent:]
                                2435 -[NSApplication(NSAppleEventHandling) _handleAEOpen:]
                                  2435 -[NSApplication _sendFinishLaunchingNotification]
                                    2435 -[NSApplication _postDidFinishNotification]
                                      2435 -[NSNotificationCenter postNotificationName:object:]
                                        2435 -[NSNotificationCenter postNotificationName:object:userInfo:]
                                          2435 _CFXNotificationPostNotification
                                            2435 __CFXNotificationPost
                                              2435 _nsnote_callback
                                                2435 run
                                                  2435 4
                                                    2435 415
                                                      2435 802
                                                        2435 639
                                                          2435 132
                                                            2435 54
                                                              2435 _brl_system_TMacOSSystemDriver_Poll
                                                                2435 updateEvents
                                                                  2435 -[NSApplication nextEventMatchingMask:untilDate:inMode:dequeue:]
                                                                    2435 _DPSNextEvent
                                                                      2435 BlockUntilNextEventMatchingListInMode
                                                                        2435 ReceiveNextEventCommon
                                                                          2435 RunCurrentEventLoopInMode
                                                                            2435 CFRunLoopRunInMode
                                                                              2435 CFRunLoopRunSpecific
                                                                                2435 __CFRunLoopRun
                                                                                  2435 __CFRunLoopDoObservers
                                                                                    2435 CFQSortArray
                                                                                      2435 CFSortIndexes
                                                                                        2435 malloc_zone_memalign
                                                                                          2435 szone_memalign
                                                                                            2435 szone_malloc_should_clear
                                                                                              2435 tiny_malloc_from_free_list
                                                                                                2435 tiny_free_list_add_ptr
                                                                                                  2435 _sigtramp
                                                                                                    2435 semaphore_wait_trap
    2435 Thread_100499   DispatchQueue_2: com.apple.libdispatch-manager  (serial)
      2435 start_wqthread
        2435 _pthread_wqthread
          2435 _dispatch_worker_thread2
            2435 _dispatch_queue_invoke
              2435 _dispatch_mgr_invoke
                2435 kevent
    2435 Thread_100503
      2435 thread_start
        2435 _pthread_start
          2435 threadProc
            2435 _brl_threads_TThread__EntryStub
              2435 bb_ThreadedPrepareElements
                2435 191
                  2435 532
                    2435 141
                      2435 bbGCCollect
                        2435 collectMem
                          2435 343
                            2435 842
                              2435 bmx_freeimage_delete
                                2435 free
                                  2435 __spin_lock


Thread 1 seems to be handling the event que, and locking and freeing junk as a result of mucking about.
Thread 2 you always get in threaded apps, it seems to be the thread manager as best as I can tell...
Thread 3 is my child child thread (note, just 1 child thread at this point) trying to do cleanup after it's done with a freeimage, the freeimage is in it's delete method, which calls free on it's allocated memory block, that's halting (I assume) to wait for the main thread to get done freeing things... which it won't because (again I assume) it's been confused by the child thread trying to free things.

And yet again, just for the record, this is just one manifestation in one program.


ima747(Posted 2010) [#16]
I literally COVERED the suspected problem areas with Delay(20)'s and it seems to not hang (usual disclaimer with randomish crashes etc.)... I think you're very much on to something with the high speed lock/unlock causing problems, and that feeds back to my theory that the GC problem could actually be a thread control issue (i.e. the threads locking/unlocking)...

Hope! there is hope!


jondecker76(Posted 2010) [#17]
It was the same case in a project of mine. I purposely had to make my Lock/Unlock take longer than it should. If I remember right, here is what I did: (pseudo)

lockMutex(imageMutex)
thisPixmap=GetAPixmap()'external function
unlockMutex(imageMutex)
thisImage=LockPixmap(thisPixmap)
'The above code would randomly crash from 30 seconds to 2 minutes into running



Then, to force the time between LockMutex and UnlockMutex to be longer,
I simply kept the mutex locked until thisImage was created...
lockMutex(imageMutex)
thisPixmap=GetAPixmap()'external function
thisImage=LockPixmap(thisPixmap)
unlockMutex(imageMutex)
'This time, the above code works crash-free (and I've even let it run overnight)
'and the only difference is the location of UnlockMutex


Anyways, the above example is how I got my code to run absolutely crash free


ima747(Posted 2010) [#18]
Thanks! I'm so far so good with a delay 20 added before a manual gccollect() call added after resuming the garbage collector (I had problems with the collector running while doing Some of the copys sometimes specifically in child threads. I think this also is prevent too many lock/unlock cycles on some mutexes... I'll need more poking and testing to verify but this is the first positive progress I've seen on this problem in a long time so I'm quite optimistic!


ima747(Posted 2010) [#19]
another sample

SuperStrict



Global theMutex:TMutex = CreateMutex()
Global counter:Int = 0

Function tfunc:Object(in:Object)
	While(True)
		LockMutex(theMutex)
		counter:+1
		Local pixm:TPixmap = CreatePixmap(2048, 2048, PF_RGBA8888)
		UnlockMutex(theMutex)
	Wend
End Function

CreateThread(tfunc, Null)

Print "starting"
While(True)
	LockMutex(theMutex)
	counter:+1
	UnlockMutex(theMutex)
	If(counter >= 10000000)
		Print MilliSecs()
		counter = 0
	End If
Wend

tossed that up on my PC while trying some stuff, it crashes right away on the create pixmap in the child thread with an access violation while trying to alloc the memory.


jondecker76(Posted 2010) [#20]
Compiled on Linux, your example above also crashes with a segmentation fault.. But to further prove a point, add a simple delay in the thread and presto!
SuperStrict



Global theMutex:TMutex = CreateMutex()
Global counter:Int = 0

Function tfunc:Object(in:Object)
	While(True)
		LockMutex(theMutex)
		counter=counter+1
		
		Local pixm:TPixmap = CreatePixmap(2048, 2048, PF_RGBA8888)
		UnlockMutex(theMutex)
		Delay(100)
	Wend
End Function

CreateThread(tfunc, Null)

Print "starting"
While(True)
	LockMutex(theMutex)
	counter=counter+1
	UnlockMutex(theMutex)
	If(counter >= 10000000)
		Print MilliSecs()
		counter = 0
	End If
Wend




ima747(Posted 2010) [#21]
I'm having great success with a bunch of delays peppered around. No more hangs and no crashes, however it does cause the application to leak like a sieve... it did this some other times when messing around with auto vs/manual GC... I'm not sure where it comes from but it's related as the memory is totally fine without delays but it will either crash or hang sooner or later. With delays no crash or hang but it will leak and leak until it chokes...

At this point I'll take the leaks over the crashing but still something to get worked out...

Still grinding


jondecker76(Posted 2010) [#22]
I have no problems with Auto GC with my threaded applications. I remember that you mentioned that you modified the CG code and now run it manually. You may find now that you have injected some delays in your thread, that if you restore the original GC code, it may work just fine for you and git rid of your memory leak


ima747(Posted 2010) [#23]
I restored the GC code before starting with the delays (on the theory that by that point I'm sure I'd broken something). I've noticed the leaking in the past under certain circumstances. I think I may try modifying the GC again to see if that cleans up some of the leaking.


jondecker76(Posted 2010) [#24]
You aren't by chance using MaxGUI in your thread, are you? I only mention this because you could create a memory leak by not calling FreeGadget()...


ima747(Posted 2010) [#25]
MaxGUI is used earlier in my program, but not in any child threads, and is totally shut down by the time I get to the part that runs for a while and leaks.

I'm going to look back over my code and see if I can narrow down what object(s) are leaking, maybe there's a free that's getting missed somewhere due to my structure.


marksibly(Posted 2010) [#26]
Hi,

I've found one issue to do with allocating lots of large un-GCed memory - eg: the way pixmap does.

Can you give this a try - it at least fixes the above!

http://www.blitzbasic.com/tmp/blitz.mod.zip

Replace your existing mod/brl.mod/blitz.mod folder with this 'un.


ima747(Posted 2010) [#27]
I've been making lots of workarounds, I'll pull as many out as I can and give this a go right now. Thanks mark!


ima747(Posted 2010) [#28]
So far so good on mac an PC. I am noticing the occasional slight delay (half a second or so) sometimes right about when I would expect a large free to be happening (such right about when I would expect my program to release all contact with a large pixmap), is this likely to be a result of the new changes or just my imagination? It's not a deal breaker (I mean I am dealing with LARGE chunks of memory so I should expect some things take a little time), just curious if that's a sign of the new code kicking in.


ima747(Posted 2010) [#29]
Seems better than before, however it will still crash or hang if 2 allocs happen at the same time, and possibly one triggers the collector...

Related: I've been toying with turning off the auto collector so I can control when the collects happen (so I know an alloc isn't taking place). Whenever allocs will happen I lock a mutex, I then call GCCollect() whenever the mutex isn't locked in my main loop. This seems to work from a stability standpoint (as long as I don't miss any allocs with my mutex lock) but it creates a pause that grows in duration (especially on PC, but mac as well) the longer my program runs. I further set it so it only ran a GCCollect() once per second in the main loop, if the mutex wasn't locked, and it was perfectly smooth on the PC to start, I came back about 20 minutes later and there was about a 1/4 second pause once per second...

[Update]
Here's a sample of my application locking up due to 2 allocs at the same time... Main thread is trying to alloc an object, which triggers a GCCollect, which tries to alloc an object in the collection process, and end in a spin lock. Thread 2 is trying to alloc an object which causes the GC to try to lock the collector mutex and waits.

Call graph:
    2367 Thread_179469   DispatchQueue_1: com.apple.main-thread  (serial)
      2367 start
        2367 _start
          2367 main
            2367 -[NSApplication run]
              2367 -[NSApplication nextEventMatchingMask:untilDate:inMode:dequeue:]
                2367 _DPSNextEvent
                  2367 AEProcessAppleEvent
                    2367 aeProcessAppleEvent
                      2367 dispatchEventAndSendReply(AEDesc const*, AEDesc*)
                        2367 aeDispatchAppleEvent(AEDesc const*, AEDesc*, unsigned long, unsigned char*)
                          2367 _NSAppleEventManagerGenericHandler
                            2367 -[NSAppleEventManager dispatchRawAppleEvent:withRawReply:handlerRefCon:]
                              2367 -[NSApplication(NSAppleEventHandling) _handleCoreEvent:withReplyEvent:]
                                2367 -[NSApplication(NSAppleEventHandling) _handleAEOpen:]
                                  2367 -[NSApplication _sendFinishLaunchingNotification]
                                    2367 -[NSApplication _postDidFinishNotification]
                                      2367 -[NSNotificationCenter postNotificationName:object:]
                                        2367 -[NSNotificationCenter postNotificationName:object:userInfo:]
                                          2367 _CFXNotificationPostNotification
                                            2367 __CFXNotificationPost
                                              2367 _nsnote_callback
                                                2367 run
                                                  2367 4
                                                    2367 1322
                                                      2367 2422
                                                        2367 77
                                                          2367 666
                                                            2367 809
                                                              2367 278
                                                                2367 _sidesign_minib3d_TEntity_MoveEntity
                                                                  2367 bbObjectNew
                                                                    2367 bbGCAllocObject
                                                                      2367 allocMem
                                                                        2367 collectMem
                                                                          2367 353
                                                                            2367 876
                                                                              2367 _bah_freeimage_TBPHolder_Create
                                                                                2367 bbObjectNew
                                                                                  2367 bbGCAllocObject
                                                                                    2367 __spin_lock
    2367 Thread_179470   DispatchQueue_2: com.apple.libdispatch-manager  (serial)
      2367 start_wqthread
        2367 _pthread_wqthread
          2367 _dispatch_worker_thread2
            2367 _dispatch_queue_invoke
              2367 _dispatch_mgr_invoke
                2367 kevent
    2367 Thread_179481
      2367 thread_start
        2367 _pthread_start
          2367 threadProc
            2367 _brl_threads_TThread__EntryStub
              2367 bb_ThreadedPrepareElements
                2367 190
                  2367 539
                    2367 brl_filesystem_StripDir
                      2367 bbStringSlice
                        2367 bbStringNew
                          2367 bbGCAllocObject
                            2367 pthread_mutex_lock
                              2367 new_sem_from_pool
                                2367 _sigtramp
                                  2367 semaphore_wait_trap



ima747(Posted 2010) [#30]
I'm a bit confused by this now... seems to be the last lingering problem with my current structure.

The garbage collector is in mode 2 (manual). The main thread has locked a mutex through TryLockMutex() that controls if the garbage collector is allowed to be called. Since it succeeded, it calls GCCollect() (translates to bbGCCollect) and that calls collectmem, then something, then it calls pthread_detach, which calls pthread_join, and then a spin lock...

The child thread is waiting for the garbage collector mutex to unlock so it can continue with it's task. and seems to be waiting patiently like it should...

What's up with the detach and joins?

Call graph:
    2315 Thread_323551   DispatchQueue_1: com.apple.main-thread  (serial)
      2315 start
        2315 _start
          2315 main
            2315 -[NSApplication run]
              2315 -[NSApplication nextEventMatchingMask:untilDate:inMode:dequeue:]
                2315 _DPSNextEvent
                  2315 AEProcessAppleEvent
                    2315 aeProcessAppleEvent
                      2315 dispatchEventAndSendReply(AEDesc const*, AEDesc*)
                        2315 aeDispatchAppleEvent(AEDesc const*, AEDesc*, unsigned long, unsigned char*)
                          2315 _NSAppleEventManagerGenericHandler
                            2315 -[NSAppleEventManager dispatchRawAppleEvent:withRawReply:handlerRefCon:]
                              2315 -[NSApplication(NSAppleEventHandling) _handleCoreEvent:withReplyEvent:]
                                2315 -[NSApplication(NSAppleEventHandling) _handleAEOpen:]
                                  2315 -[NSApplication _sendFinishLaunchingNotification]
                                    2315 -[NSApplication _postDidFinishNotification]
                                      2315 -[NSNotificationCenter postNotificationName:object:]
                                        2315 -[NSNotificationCenter postNotificationName:object:userInfo:]
                                          2315 _CFXNotificationPostNotification
                                            2315 __CFXNotificationPost
                                              2315 _nsnote_callback
                                                2315 run
                                                  2315 4
                                                    2315 1322
                                                      2315 2422
                                                        2315 77
                                                          2315 bbGCCollect
                                                            2315 collectMem
                                                              2315 244
                                                                2315 pthread_detach
                                                                  2315 pthread_join$NOCANCEL$UNIX2003
                                                                    2315 __spin_lock
    2315 Thread_323552   DispatchQueue_2: com.apple.libdispatch-manager  (serial)
      2315 start_wqthread
        2315 _pthread_wqthread
          2315 _dispatch_worker_thread2
            2315 _dispatch_queue_invoke
              2315 _dispatch_mgr_invoke
                2315 kevent
    2315 Thread_323610
      2315 thread_start
        2315 _pthread_start
          2315 threadProc
            2315 _brl_threads_TThread__EntryStub
              2315 bb_ThreadedPrepareElements
                2315 183
                  2315 549
                    2315 _bb_TElement_init
                      2315 135
                        2315 brl_threads_LockMutex
                          2315 _brl_threads_TMutex_Lock
                            2315 pthread_mutex_lock
                              2315 new_sem_from_pool
                                2315 _sigtramp
                                  2315 semaphore_wait_trap



marksibly(Posted 2010) [#31]
Hi,

Unless you post some more runnable code, I'm afraid there's not much I can do - stack traces aren't particularly useful in these cases, as with threading the problem may have already occured long before the crash.

Have you tried running the app with plain old auto-GC enabled?

There's a chance that if you've disabled GC and the app needs to allocate memory and can't it'll just fail and BANG - esp. with large allocations as I suspect your app is using.


ima747(Posted 2010) [#32]
Auto GC causes many many more crashes as it will fire when something is allocating quite often and then it dies. The reason I've switched back to manual GC is I can control when the collect happens, and therefore be sure than no child threads are busy allocating anything (through the use of a mutex).

I'm still working on trying to punch up an example, but without much success, as even in my sprawling project it doesn't happen reliably so it's very hard to narrow down what/where/when/how/why something is going wrong. The only commonality I notice (as illustrated by the traces) is that problems are always within an alloc or free, and are much much much more prevalent if memory is being handled in 2 places at once (such as an alloc in the main and child threads at the same time).

I was experiencing some problems with semaphores a while ago as well which caused me to abandon them as a means of restricting simultanious access, I'll see if I can re-create that problem with some sample code as perhaps that will be easier than my current flow.

I don't think there's an allocation space issue, as if I dissabled the collector all together (just to see) it will run up to around 1gb alloced before anything bad starts to happen, where as it is usually running around 60-260mb with manual collection, and if I put it on auto it will spike up to about 400 before collecting sometimes. So there should be plenty of overhead, I tend to collect roughly every 10th of a second (assuming there's nothing blocking the collect) so the pool never rises, it will collect after every large alloc/free (not guaranteed due to timing but it should never pass 2 large alloc/free's), and it runs in a loop with the same content, usually for hours (6+) without any problems, and sometimes it will choke and die within minutes.

Will try to get more sample code for you, just particularly curious what the "2315 pthread_join$NOCANCEL$UNIX2003" trace meant, and also why it's detaching/joining in the collect cycle.


jondecker76(Posted 2010) [#33]
I am also still having problems in my threaded app that also deals with pixmaps. It will randomly hang (not a full crash per se). I have tried the modified blitz.mod posted by Mark, but I'm still having problems.


ima747(Posted 2010) [#34]
Here's an interesting dump I got from a tester. Still no code I know, still working on that...

Call graph:
    2882 Thread_1175   DispatchQueue_1: com.apple.main-thread  (serial)
      2882 start
        2882 main
          2882 launchd_runtime
            2882 mach_msg
              2882 mach_msg_trap
    2882 Thread_1176
      2882 thread_start
        2882 _pthread_start
          2882 kqueue_demand_loop
            2882 select$DARWIN_EXTSN

Total number in stack (recursive counted multiple, when >=5):

Sort by top of stack, same collapsed (when >= 5):
        mach_msg_trap        2882
        select$DARWIN_EXTSN        2882
Sample analysis of process 217 written to file /dev/stdout


This time thread 1 (not my thread, the one bmax runs I assume to trap events) seems to have found something more interesting to occupy it's time....

Will keep trying to get a good example of some form of this hanging/crashing. It keeps manifesting in such different ways it's quite annoying.


jondecker76(Posted 2010) [#35]
Just a follow up:
Now running BMX v1.41

My MT code is now rock solid - but not all due to BMX 1.41.

In my case, it came back to the fact that OGL isn't 100% thread safe. My random crashes appear to have came from the fact that I was Locking/UnLocking mutexes around Max2D commands (mainly DrawImage, which turned out to be the biggest culprit).

My before code (pseudo) that would crash:
(Notice that I'm locking a mutex around an external c function, and around drawImage)
(Note that the Update() method happens in its own thread, and the Draw() method happens in the main thread)
Type TWebCam
	Field image:TImage
	Field pixmap:TPixmap
	...
	...

	Method Update()
		LockMutex(pixmapMutex)
			Self.pixmap.pixels=grab_frame() 'grab_frame is an external c function
		UnlockMutex(pixmapMutex)
		LockMutex(imageMutex)
			Self.image=LoadImage(Self.pixmap)
		UnlockMutex(imageMutex)
	End Method
	
	Method Draw(x:Int,y:Int)
		LockMutex(imageMutex)
			DrawImage(Self.Image,x,y)
		UnlockMutex(ImageMutex)
	End Method

End Type





AFTER:
Since the webcam image is returned as a pixmap, and I only need a TImage when its drawn, I make one on the fly in my draw method. Also notice that I no longer lock a mutex around the external c function, or the Max2D DrawImage() function...
Type TWebCam
	Field pixmap:TPixmap
	...
	...

	Method Update()
		Local grabbedPixmap:TPixmap=CreatePixmap(640,480)
	
		grabbedPixmap.pixels=grab_frame() 'grab_frame is an external c function
		
		LockMutex(pixmapMutex)
			Self.pixMap=grabbedPixmap
		UnlockMutex(pixmapMutex)
	End Method
	
	Method Draw(x:Int,y:Int)
		Local thisImage:TImage
		
	
		LockMutex(pixmapMutex)
			thisImage=LoadImage(Self.pixmap)
		UnlockMutex(pixmapMutex)
		
		DrawImage(thisImage,x,y)
	
	End Method
End Type




These simple changes have made my application 100% stable.

Ima747: Look for similar things in your MT code, and find way around Locking/Unlocking mutexes around Max2D functions and external c functions. Then you will either fix your problem, or eliminate the possibility that something that you are threading isn't really thread safe...