Max crashes after a while!

BlitzMax Forums/BlitzMax Programming/Max crashes after a while!

Boiled Sweets(Posted 2006) [#1]
I've seen reports that Max crashes if left running for a while. This seem to be the case as my MAX screen saver crashes after some hours.

WHY IS THIS?

WILL IT BE FIXED?


Space_guy(Posted 2006) [#2]
Well. i certanly do hope it will be fixed. this bug really stopped all programming i did at work :(


Scott Shaver(Posted 2006) [#3]
I'd bet it has something to do with the garbage collection.


Dreamora(Posted 2006) [#4]
I think it is a similiar issue to the herz problem ...
I posted something on that on the bug thread but no one ever reacted.

The GC code has a passage where it raises a variable over and over again, by 500 each time. This variable is never lowered ... and the problem is, that this happens in automatic mode, each time memory is allocated. -> would be a potential crash reason as the number is used for other stuff so if it flips to -maxint ... you know what will happen.


kenshin(Posted 2006) [#5]
Sounds like they need to cap that variable or do the Abs thing. What variable are you referring to?


Boiled Sweets(Posted 2006) [#6]
Well this bug renders MAX useless for screen savers.

Why this product was ever released in this buggy state I'll never know but it must have put a LOT of poeple of, me included!

I'll go back to Blitz3d until this is a stable and 'finished' product


kenshin(Posted 2006) [#7]
I see it. Look at this taken from blitz_gc.c
BBGCMem *bbGCAlloc( int sz,BBGCPool *pool ){
	if( (gc_mode & 3)==1 ){
		static int alloced;
		static int rate=500;
		alloced+=sz;
		if( alloced>(1024*1024) || buf_put-buf_base>rate ){
			collectMem(0);
			rate+=500-gc_objsfreed;
			alloced=0;
		}
	}else if( (gc_mode & 3)==3 ){
		collectMem(0);
	}
	BBGCMem *q=(BBGCMem*)bbGCMemAlloc( sz );
	q->pool=pool;
	q->refs=0;
	setMemBit( q );
	bbGCFree( q );
	return q;
}



Boiled Sweets(Posted 2006) [#8]
FIX IT BRL PLEASE


FlameDuck(Posted 2006) [#9]
FIX IT BRL PLEASE
How? Wave a magic wand? This isn't Harry Potter.

If you want something fixed you're going to have to provide alot more than vague and ambigous directions and third party speculation.

For starters, how about a managable bit of sourcecode that reproduces the issue? How about a little more detail about the bug? How about posting it in the correct forum?

Personally my immediate thoughts are that it's an end-user issue. Why? Because we've had a BlitzMAX server running for weeks at a time, and it only crashed because of flaws in our logic. I'm not saying you haven't found a real bug, but the least you can do is prove it.


Hambone(Posted 2006) [#10]
You can tell Flameduck is a "real" programmer.

User has problem with software not working correctly. Programmer says "prove it". User says "How, I didn't write it and am not a developer."

Programmer says "You are not worthy, go away and stop bothering me."

User goes away and finds software that works as expected from a programmer who realizes users are not beta testers unless they specifically volunteer and then the SW is usually provided for free.

My two cents.

Allan


Boiled Sweets(Posted 2006) [#11]
Hambone,

I am both - a user and programmer so fully apprecaite your thoughts.

As I understood it there IS an existing *known* problem with MAX crashing after a peroid of time.

If you want me to prove that it happens, FlameDuck, run the 'safe' code of the screen saver framework for some hours...

Beyond that I have NO time to prove that MAX is not ready for public consumption -- all I know is it has a lot of problems and until sorted it is not usable.


GfK(Posted 2006) [#12]
I'm amazed that nobody on the BRL test team spotted this. All you have to do is leave a program running overnight and see what happens - standard practice in the games industry.

For starters, how about a managable bit of sourcecode that reproduces the issue? How about a little more detail about the bug?
Eikon has posted code in another thread which reproduces the error. That thread is now in the bugbin, amid claims from Mark that he's fixed it (when in reality it seems cause remains but the effect has changed). I guess he didn't think that actually testing it out was a wise use of his time and resources, either.

Beyond that I have NO time to prove that MAX is not ready for public consumption -- all I know is it has a lot of problems and until sorted it is not usable.
Perhaps a little harsh, but as bugs go, this one is just about as bad as they get.


TomToad(Posted 2006) [#13]
Gfk, that thread you point to describes a problem where Flip no longer syncs properly when left running for a while.
The problem BoiledSweets is having is more simular to this
http://www.blitzbasic.com/Community/posts.php?topic=58297
I don't see anywhere in that thread where Mark said he found the reason or fixed the problem. Tonyg mentioned in that thread that the two problems might be related, so I guess that's why everyone thought it was fixed.


Space_guy(Posted 2006) [#14]
well i beleve(not yet tested) that this piece of code will crash blitz if kept running for aloong time.


Repeat
	Delay 1
Until KeyDown(key_escape)




Josepho(Posted 2006) [#15]
I though this was fixed on the 1.20 version

+ (BRL.Graphics) Fixed soft synced Flip so it doesn't overflow after 4+ hours.


Dreamora(Posted 2006) [#16]
Thats th flip fix.
But the GC itself has a similar counter issue (see the code block above from _gc.c), that will raise to over the maximum size because it raises with 500 - small each time memory is allocated ... and many commands allocate memory. Each working with strings for example.


TomToad(Posted 2006) [#17]
Dreamora, not true. It decreases everytime that gc_objsfreed is more than 500, which is apparently a lot. I rewrote the modules so that I can keep an eye on what value rate is. So far I've run a test program for about 30 minutes and rate remains at about 500. Occasionally it will go up to over 1000, and occasionally down to about 400, but it is at 499-500-501 99% of the time.
I'm also doing some string functions at the same time so the GC is getting a workout.
Obviously this is not what's causing the slowdown.


Dreamora(Posted 2006) [#18]
Haven't gotten it over 500

You can set GCSetmode to 5 and you will see what value it had on the last collect run (the 3rd last value is objfreed)

You would need to have 500 obj per collection run that went out of scope to be above that ... which theoretically can be several 10k obj ... and if you have that many going out of ref ...


TomToad(Posted 2006) [#19]
Well, my test program ran for 90 minutes and rate always stayed around 499-501. Occasionally it would spike to over 1000, but then drop back down again. So gc_objsfreed would have to be over 500 at some time. Could it be that mode 5 displays the avrage objects freed over several GCColects?


TomToad(Posted 2006) [#20]
Ok, after using GCSetMode 5, this is the first 10 lines printed
GC collectMem: memFreed=1198, time=0ms, objsFreed=29, objsScanned=121, objsLive=34
GC collectMem: memFreed=16932, time=0ms, objsFreed=423, objsScanned=64, objsLive=2
GC collectMem: memFreed=30342, time=0ms, objsFreed=957, objsScanned=64, objsLive=2
GC collectMem: memFreed=16830, time=0ms, objsFreed=574, objsScanned=73, objsLive=11
GC collectMem: memFreed=14716, time=0ms, objsFreed=517, objsScanned=73, objsLive=12
GC collectMem: memFreed=14332, time=0ms, objsFreed=492, objsScanned=41, objsLive=4
GC collectMem: memFreed=14146, time=0ms, objsFreed=492, objsScanned=43, objsLive=4
GC collectMem: memFreed=14936, time=0ms, objsFreed=512, objsScanned=41, objsLive=4
GC collectMem: memFreed=14664, time=0ms, objsFreed=504, objsScanned=37, objsLive=1
GC collectMem: memFreed=14244, time=0ms, objsFreed=499, objsScanned=43, objsLive=4

As you can see, there are several times objsFreed was over 500. let's step through it.
rate = 500
rate :+ 500 - 29 = 971
rate :+ 500 - 423 = 1048
rate :+ 500 - 957 = 591
rate :+ 500 - 574 = 517
rate :+ 500 - 517 = 500
rate :+ 500 - 492 = 508
rate :+ 500 - 492 = 516
rate :+ 500 - 512 = 504
rate :+ 500 - 504 = 500
rate :+ 500 - 499 = 501

After that, objsFreed seems to settle down and is either 499, 500, or 501 each time, keeping rate at around 500 every GCCollect.


Dreamora(Posted 2006) [#21]
Theory always looks good until you proof that it does not hold in reality:



This code, at least on WinXP Pro, will run until it reaches the point where the list should be replaced.
At that point BM locks completely but the CPU usage stay on 90%+

If you replace the 100mb with 10mb, it won't lock but hang for a few moments when reaching the border, starting with negative rate again and filling up till it reaches the lock point again and the game restarts. (I've added the functionality to output the rate at the regular GC Debug output). So you are right that the rate does not seem to have problems with negative values ... but it seems like the crashes are GC problem when freeing as the issue I have is quite similar to what was mentioned to be the bug problem (100% cpu usage but not doing anything anymore)


TomToad(Posted 2006) [#22]
Totally unrelated. You are creating 100 megs of objects for the list. the GC won't delete it because the objects are still referenced within the list. Each object is at least 4 bytes, plus each link in the list is going to need at least 12 bytes. So you are trying to allocate at least 1.5 GIGS of memory to hold the list! Most computers do not have that kind of memory, if they do, Windows most definately wont leave that much for a program. When memory fills up, everything gets shoved to virtual memory and everything comes to a crawl.
If you notice, the bug that we're talking about happens even when nothing is being allocated. You can keep an eye on the memory in the task manager and see that memory usage does not increase over time.

Edit: I just reread your post and noticed that you said that negatives are not a problem, but I thought I'd leave this in because I feel it's still relavent.

As for rate becoming negative, this is not from an overflow problem. It's from such a huge amount of memory finally being freed that rate+=500-gc_objsfreed will actually return a negative value and is expected behavior.
Notice these two lines
GC collectMem: memFreed=10212300, time=223ms, objsFreed=512371, rate=22364, objsScanned=54, objsLive=15
GC collectMem: memFreed=0, time=0ms, objsFreed=0, rate=-489507, objsScanned=54, objsLive=15

notice how 512371 objects are being freed? Now do the math. 22364 + 500 - 512371 = -489507 which is the rate in the next line. Also notice that the values get nowhere close to being in overflow range which is 2147483648. As for the slight delay when the objects are freed, that's to be expected. It takes time for the GC to do it's job, and the more garbage it needs to clean up, the more time it will take. You are freeing 512371 objects, 10 megs of memory. The GC must scan each object individually and see if it's being referenced anywhere in your program, then free it from memory. Reminds me of the C64 days when BASIC programs would seem to freeze for half a minute while it's GC did it's job.

Now I'm not saying that it isn't a GC related problem. I don't have a clue, but it obviously has nothing to do with rate overflowing, nor with the example you are showing which is being caused by more memory being allocated than Windows can handle.


Dreamora(Posted 2006) [#23]
Seems like you are right ...
It just needs forever to free 20mb (on the first run after that it takes 0.5 to clean it ... really strange)


FlameDuck(Posted 2006) [#24]
If you want me to prove that it happens, FlameDuck, run the 'safe' code of the screen saver framework for some hours...
Got a link?

Programmer says "You are not worthy, go away and stop bothering me."
Or programmer says "If you can't tell me how it's broken, how do you expect me to fix it?"


Dreamora(Posted 2006) [#25]
FD: There have been several example codes that break after 5-7 hours so it was clearly shown how it is broken. the postings were not like "it breaks" and nothing else ... perhaps you missed that.
all of them did not use graphics which means no flip sync problem as well.


FlameDuck(Posted 2006) [#26]
FD: There have been several example codes that break after 5-7 hours so it was clearly shown how it is broken. the postings were not like "it breaks" and nothing else ... perhaps you missed that.
Perhaps I did. Since you may have suspected this, why didn't you post a link?

I don't think asking for a link to source code that demonstrates it is unreasonable. Like I've already said, we've had our source code running for weeks without slowdowns or crashes (that wheren't our fault). In fact I've had code run for more than 10 hours that DID have a serious flaw in it, that didn't crash or slow down in that amount of time (ofcourse it was pretty slow to begin with).

If this mystery slowdown really exists, I'd be as psyched as the other guy to nail it. Without code that reproduces the problem, and is demonstrably free of logic flaws, I don't see how that's possible.


Dreamora(Posted 2006) [#27]
The thread in the bugboard (there is one since 1.16, refreshed for 1.18) has 2 sources or even more that show the prob, so I didn't think an additional link would be needed.

Even in this thread there is one...
So far it seems like everything is able to break it after 5-7 hours ... (depending on the systems speed even longer) Most just don't let the stuff run that long.

The thread in bugboard has simple loops with peekevent(), the one here is a simple forever loop with a delay ...

The ideas on the reason so far have mostly been focused around a "flooded event queue" problem that lets BM loop forever ... Which is somewhere reasonable as some parts of BM changed between 1.12 and 1.16 like the internal event hook to handle user input and the like. This introduction did (if I haven't missed it) not introduce an automatic data drop on events on the queue, if it has not been used any longer (say after a given amount of time), which perhaps might be needed to prevent the queue from beeing filled up and break. (don't know how Windows handles event OS side ... but if it is a buffer then it might fill up that buffer and create problems if the BM app did not remove the processed events from there)


ImaginaryHuman(Posted 2006) [#28]
When you get to the end of frame, two frames after having flipped the display and processed the relevant events, just wipe out the remaining old queue'd events by using peekevent and pollevent until the peekevent=null. ?


FlameDuck(Posted 2006) [#29]
The ideas on the reason so far have mostly been focused around a "flooded event queue" problem that lets BM loop forever.
So what you're saying is that it is in fact a user error?


TomToad(Posted 2006) [#30]
I think I might have possibly found the problem. Won't know for sure until I do some testing. In the module brlmod/eventqueue.mod/eventqueue.bmx there are two variables that keep track of the queue. queue_put and queue_get. The first keeps track of where the next event is placed in the queue and the second controls where the next event is to be read, creating a FIFO queue of events. Now the variables are incremented each time one of the poll events are called (pollevent, waitevent, peekevent, etc...) the next space on the queue is determined through masking with 255, basically doing a fast queue_get Mod 256, so it will always point somewhere within a queue array. Problem is that these variables are only incremented and never decremented. Eventually they will overflow and become negative. When that happens, basically the queue will start to be read backwards instead of forward, in theory that is, wont know for sure until I do some tests.
The solution to the problem is whenever the queue is empty, to reset both pointers to 0. So I did a find for queue_get=queue_put and put the lines
queue_get = 0
queue_put = 0
immediately after. Now I'll go ahead and let a program run overnight and see if it's crashed in the morning.


Grey Alien(Posted 2006) [#31]
TomToad: I look foward to your results.


Space_guy(Posted 2006) [#32]
As do i


GfK(Posted 2006) [#33]
Me too...


Boiled Sweets(Posted 2006) [#34]
ME TOO! I didn't realise that this thread would run and run but I'm pleased it has...


TomToad(Posted 2006) [#35]
Test failed :(
After a while I began to realize that negative numbers are complimented so that value & 255 should continue to give correct results. But it was worth a try anyway.


GfK(Posted 2006) [#36]
Oh tittybiscuits.... :/


Grey Alien(Posted 2006) [#37]
bah ;-)


Dreamora(Posted 2006) [#38]
Seems like we will really need to extend the GC and other brl.blitz components to the max with debug output to find out how / why it breaks ...


GfK(Posted 2006) [#39]
Either that or we can stop clutching at straws and hope somebody from BRL at least acknowledges this is a huge problem.

I simply wouldn't release a game knowing it was going to crash after a couple of hours.


Dreamora(Posted 2006) [#40]
Me as well.
I'm really shocked that the core components never were stability tested before they added functionality ...
The only thing possible now is let it run with frameworks and add framework after framework till it crashes (and if it crashes on framework brl.blitz we know that it is the worst bug ever unfixed since Intel Floatingcalc bug as it is a main core bug)


Amon(Posted 2006) [#41]
I'd like to know what the deal with this is.

Is it a bug? Is it a user error?

To test things out I left BoiledSweets chilli screensaver running overnight and when I went back to the computer in the morning everything was fine.

I'm leaning towards it being a bug with user code.


Boiled Sweets(Posted 2006) [#42]
thats interesting Amon, I'll do the same again tonight.


Dreamora(Posted 2006) [#43]
Amon: If you find time, could you please test it with a non max2d app or even one of those mentioned in the bugthread? (simple peekevent apps)
*will see if I find a night to test that as well once again ... only have a notebook that I've on me, which is why it has normally only 4-5 hours its unused and running*


TomToad(Posted 2006) [#44]
You know, it might end up being a problem internal to the compiler. Which means you won't find the solution in one of the modules.


assari(Posted 2006) [#45]
I can do some testing. Can someone provide a link or code to test?


TomToad(Posted 2006) [#46]
This is the original thread http://www.blitzbasic.com/Community/posts.php?topic=58297
The code in the OP crashes my system sometime after 5 hours.


assari(Posted 2006) [#47]
thanks. I will let it run overnight


marksibly(Posted 2006) [#48]
Hi,

Yes we are aware of these reports and are looking into it.

We have already fixed one 'soft syncing' bug and are currently trying to reproduce the 'PollEvent' example linked to above.

The PollEvent example is strange in that there is actually very little Blitz code executing. Since there is no Window, there are no events to post so there is not even any GC activity occuring!

And please note that we have tested running Max programs for long periods of time - just not *all* Max programs, and perhaps not that recently...!


Boiled Sweets(Posted 2006) [#49]
Ran the screen saver framework for 17 hours - no crash.


marksibly(Posted 2006) [#50]
Hi,

How does your screen saver differ from your screen saver framework?!?


Boiled Sweets(Posted 2006) [#51]
Mark,

not too sure what you mean. I used the screen saver framework from the archives...

http://www.blitzbasic.com/codearcs/codearcs.php?code=1677

Now I know I started the thread saying Max crashes after a while and did see that my screen saver (HEAVILY based on the framework) wasn't running in the morning (the MAX application icon was still in the task bar but the app wasn't visible. I was certain some other people had reported a crash after several hours so posted this thread.

I have since run overnight and it has not crashed. This is unconclusive I guess but still there seems to be an known issue.

Sorry for any confusion, keep up the good work!


assari(Posted 2006) [#52]
I ran the code posted here www.blitzbasic.com/Community/posts.php?topic=58297 for about 7 hrs with the following results:-

Windows XP SP 2 1GB RAM, 1.7GHz Intel Radeon 9800 Pro (BlitzMax 1.20)
I tried both on debug and non-debug. Both times the machine hang.

Windows XP MCE 1GB RAM, Athlon X2 64 3800+ Nvidia 7600GT(BlitzMax 1.20)
CPU was at 50%. Machine did not hang


Mark Tiffany(Posted 2006) [#53]
CPU was at 50%. Machine did not hang

But presumably only because you've got 2 CPUs. Max was still thrashing the one CPU it was running on, and the other could handle anything else you did. Hence 50%...


FlameDuck(Posted 2006) [#54]
Interesting. Maybe it has something to do with the windows task scheduler screwing up in some way? I can't really think of other reasons it doesn't crash on dual core / hyperthreading CPU.

Maybe it's related to this?


Dreamora(Posted 2006) [#55]
It crashes on DC / HT as well but because they have 2 CPUs (either real or virtual), you won't realize that directly until you use something that would use both cores which then will run on half speed compared to normal.


RocketGnome(Posted 2006) [#56]
Ok.. this might be silly to say...

But here goes...

Windows sent down an update a few days ago.

I know at work, I've had issues with auto-update hanging apps, and sometimes locking the PC.

When you reboot, Windows applies the update, so it's not immediately apparent why the lock-up occurred.

Is it possible that the lock-up occurred during the recent windows update deployment?

Or is there some other auto-updating software such as anti-virus, etc. on the PC?


GfK(Posted 2006) [#57]
Windows sent down an update a few days ago.

I know at work, I've had issues with auto-update hanging apps, and sometimes locking the PC.

When you reboot, Windows applies the update, so it's not immediately apparent why the lock-up occurred.

Is it possible that the lock-up occurred during the recent windows update deployment?
Not in my case. I don't let Windows install updates by itself - it just tells me when there are some.

Or is there some other auto-updating software such as anti-virus, etc. on the PC?
Actually, yes. I have AVG running, and it does a complete system scan daily at 8am.

Quite sure that isn't the cause, but I'll run my code overnight again tonight, without AVG running.


WendellM(Posted 2006) [#58]
I have AVG running

Do you have its Resident Shield feature enabled? I ask because at work a while back our PC would sometimes (rarely) not save a file when it should, and the saving app would crash. It turned out that the problem was having McAfee's virus scanner set up to scan files when they were read/written (which sounds like what AVG's resident shield does). Turning off that "feature" of the McAfee antivirus software stopped the problem.

This might not be applicable in your case (or with AVG) at all, but thought I'd mention it just in case.


*(Posted 2006) [#59]
Hopefully this can be fixed otherwise writing server/client apps for a 24/7 server would be next to useless in max as on some machines it would hang.

I hope this gets fixed soon :)


(tu) ENAY(Posted 2006) [#60]
Alternatively, you could put in the instructional manual of your game.

"When using this product please rest for 5-10 minutes for every hour of play (plus a reboot of your PC wouldn't hurt..)"


GfK(Posted 2006) [#61]
OK, game is now running on my other PC and will remain so all night. I've disabled AVG, will report what happens in the morning.

I'm not feeling very optimistic, but we'll see....


GfK(Posted 2006) [#62]
Well, its not AVG that's affecting it. Slowed right down again.


skidracer(Posted 2006) [#63]
We've been able to reproduce the problem in the office which is I suspect half way to finding the cause.


Grey Alien(Posted 2006) [#64]
aha well that's progress, good.


GfK(Posted 2006) [#65]
Yup.

If you wanna try out various bits of code to try to eliminate/identify the problem then I'm sure there are plenty of people here who would be happy to let their PCs run all night (myself included).


Boiled Sweets(Posted 2006) [#66]
Me too


Space_guy(Posted 2006) [#67]
thats great news. this bug has been haunting me


skidracer(Posted 2006) [#68]
OK, have hopefully fixed a problem in freeaudio where the playback position calculation was overflowing. Please syncmods and let us know how you get on with another soak test.


GfK(Posted 2006) [#69]
Running a test now, will report back later.


GfK(Posted 2006) [#70]
Game's been running for 8.5 hours now, no slowdown to report.

Think you've nailed this one at last!!

:)


GfK(Posted 2006) [#71]
10 hours now. All's good. :)


Grey Alien(Posted 2006) [#72]
This seems positive. We just me more tests now.


Space_guy(Posted 2006) [#73]
it failed here even after the freeaudio update. although ill give it another try


GfK(Posted 2006) [#74]
Failed how?


Space_guy(Posted 2006) [#75]
in the same way it always fail.. eats all the proccessor power it can after x amounts of hours.

i will however test again. hopefully tommorow.


GfK(Posted 2006) [#76]
Hm... try rebuilding all modules and turning off Quickbuild.


GfK(Posted 2006) [#77]
I'm doing another soaktest tonight to double check.

Got my BlitzMax game + BlitzMax screensaver running on one PC, and just the screensaver running on the other.

Hope all's well with both PCs in the morning...


Space_guy(Posted 2006) [#78]
good luck :) me i will start my tests in a few hours.


GfK(Posted 2006) [#79]
Everything's fine here again. No problems to report.


assari(Posted 2006) [#80]
I re-ran the test code on 1 PC last night and did not experience the crash as before. So the fix appears to have worked.

Will try on another PC later.


Space_guy(Posted 2006) [#81]
No crashes here!

:)

Thank you so much for the fix. Finnaly I can go back to working on my work project!


Grisu(Posted 2006) [#82]
Ahm, something connected to this.

Will a tumer (CreateTimer) crash an app when running for hours?


Space_guy(Posted 2006) [#83]
not according to my test. i have a timer in it


Grisu(Posted 2006) [#84]
Thank you!