Debug build runs 2x faster!!?
BlitzMax Forums/BlitzMax Programming/Debug build runs 2x faster!!?
| ||
I'm working on my first commercial application in many years. I'm about 75% complete, so today I decided to turn the debug build option off to see how much faster it would run, and how much smaller the executable would be. To my surprise, with debug turned off, my application runs about half as fast as it does with debug on! (about 35 FPS with debug on, and only 15 FPS with debug off) I'm building on a Linux box (Ubuntu 10.04) and my application is threaded. My initial thoughts are that perhaps there is a bug with Threads in linux in that they only work with debug enabled... (I'm guessing this because my application ran at about 15 FPS before threading with debug enabled, and 35 FPS with threading with debug enabled.... Now that debug is disabled, it almost acts like it lost the performance gains of threading) EDIT: Just did some more verification... Threading appears to act different in Linux depending on whether or not Debug build is enabled. Can mod move this to bug reports? I can't post the code, as this is a commercial project - though I would send it to mark himself to verify FYI: This is a Photobooth application, and I threaded only the part which grabs the images from the attached camera. Since the camera runs at 15 FPS, it caused my entire program to run at 15FPS if that part isn't threaded... This is the reason for the huge performance gain using threads in my application |
| ||
Ok, whipped up a quick example illustrating that there is indeed a difference in threading with and without debug enabled... (although in this example, threading seems to do nothing with Debug build enabled, but in my other application, the effect is opposite!!) Could others run this, and post their results for: A) THREADED=False debug disabled B) THREADED=False debug enabled C) THREADED=True debug disabled D) THREADED=True debug enabled I'll go first... A) 5 fps B) 3 FPS C) 36 FPS D) 6 FPS ...Here is the test code... 'Threaded build bug demonstration 'Shows that threads behave differently depending on whether or not Debug build is enabled! 'As a side effect, shows how to make your program so that it can be run threaded or non-threaded! SuperStrict 'Use threads or no?? Change this to see that: 'A) The program will run SLOW while THREADED=false 'B) The program will run fast (with slow pixmap updates) while THREADED=true and Dubug build is on 'C) The program will run SLOW AGAIN while THREADED=true and Debug build is off Const THREADED:Int = False Global updateThread:TThread Global updateMutex:TMutex If THREADED updateMutex=CreateMutex() EndIf 'This is the target pixmap that will be "randomized" by the UpdatePixmap() function Global pixmap:TPixmap=CreatePixmap(640,480,PF_RGB888) 'Here is a function that is designed to opdate a pixmap at about 5-10 FPS 'If put inline, the entire program will run at about 5-10 FPS! 'If threaded, then the pixmap will still only update at 5-10FPS, but the rest of the 'program in the main thread will run as fast as possible. Function UpdatePixmap:Object(data:Object) Local r:Int,g:Int,b:Int Local rgb:Int Local x:Int,y:Int Repeat If updateMutex Then LockMutex(updateMutex) For x = 0 To PixmapWidth(pixmap)-1 For y = 0 To PixmapHeight(pixmap)-1 r=Rnd(255) g=Rnd(255) b=Rnd(255) rgb=r*$10000 + g*$100 + b WritePixel(pixmap,x,y,rgb) Next Next If updateMutex Then UnlockMutex(updateMutex) Delay(100) Until String(data)<>"" 'The way we will call the function, is pass "" if we are threaded, and anything else if we are not threaded so we fall through! End Function Graphics 640,480 'We are THREADED, so lets thart the update thread! If THREADED updateThread=CreateThread(UpdatePixmap,"") EndIf While Not KeyDown(KEY_ESCAPE) Cls 'We are NOT THREADED, so lets call UpdatePixmap every iteration! (bogus string parameter is necessary to tell the function that we are not being called as a thread!) If Not THREADED UpdatePixmap("blah") EndIf 'Lets draw the pixmap - make sure we lock/unlock our Mutex if we are threaded! If updateMutex Then LockMutex(updateMutex) DrawPixmap(pixmap,0,0) If updateMutex Then UnlockMutex(updateMutex) 'Now lets do something else so that we can see the framerate difference! SetColor 255,255,255 DrawOval(320+(Sin(MilliSecs()/10)*100.0),240,20,20) UpdateFrameCounter() 'FPS calculations.... DisplayInfo() Flip End While 'FPS Code Global FramesPerSec:Int Global FrameCounter:Int Global fpsStart:Int Function UpdateFrameCounter() FrameCounter:+1 If MilliSecs() > fpsStart + 1000 fpsStart = MilliSecs() FramesPerSec = FrameCounter FrameCounter = 0 EndIf End Function Function DisplayInfo() Local isThreaded:String Local isDebug:String If THREADED isThreaded="TRUE" Else isThreaded="FALSE" EndIf isDebug="FALSE" ?debug isDebug="TRUE" ? SetColor 0,0,0 DrawRect 0,0,150,70 SetColor 255,255,255 DrawText "FPS: " + FramesPerSec,10,10 DrawText "Threaded: " + isThreaded,10,30 DrawText "Debug: " + isDebug,10,50 End Function |
| ||
Test machine: Windows 7 Ultimate (64bit) on a Lenovo Z61p, Intel Core2 2GHz CPU, 3GB RAM A) THREADED=False debug disabled - 5 FPS B) THREADED=False debug enabled - 2 FPS C) THREADED=True debug disabled - 38 FPS D) THREADED=True debug enabled - 2 FPS |
| ||
wow, 38FPS down to 2FPS just by enabling debug on a threaded application! This surely has to be some kind of bug! (I'm still scratching my head as to why on my real world application, its actually faster in dubug, and slows to a crawl when debug is disabled...) |
| ||
not really a bug at least not from this example that primarily is a good collection of DONT and NEVER DO in relation to threading. |
| ||
I still think its a bug, debug build simply should not incur a 15x performance loss. On top of that, my sample above does show that threaded applications behave quite differently depending on whether debug build is set or not (I mean how can we do a debug build to test our development if it drags it down to 2FPS???). Even on top of that, in an application I am currently writing, Debug build runs 2x faster than non-debug on a threaded application (though I haven't been able to do a simple example that reproduces this exact behavior) - this again shows that something weird is happening with threaded apps depending on whether debug build is set or not. Again, I'd like to have a mod move this to Bug Reports (as I'd rather not cross-post) thanks |
| ||
@ dreamora what is wrong with his code? in other words, what did he do that goes under dont and never do? |
| ||
Debug will always run more slowly than Release... That is the nature of having all the extra debug program flow. If you have LOTS of inefficient loops, debug will appear to run much more slowly than release. |
| ||
yes, but not 15x slower (30+ FPS down to 2 FPS) on a SIMPLE example. Also, the example I posted isn't filled with inefficient loops. This is common sense that something isn't right, seriously. And, once again I will say that I also have a program I am working on where the opposite is true (dubug runs 2x faster with another threaded app). The point is that it is inconsistent, and behaves differently depending on whether Debug build is set or not. |
| ||
Also, the example I posted isn't filled with inefficient loops. I beg to differ... Your writepixel loop is very inefficient... in fact, it takes much longer than 100 millisecs to run on my PC... In debug mode, you will end up with something along the lines of : Basic execution : 307,200 times debugs to determine pixmap height + width : 614,400 times Debug per statement : at least 1 call each (it's much more in most cases), of 5 statements... 1,536,000 Minimal total per call : 2,457,600 Also, Rnd() is not the most efficient function in the world. A slightly modified example of the function : Local w:Int = PixmapWidth(pixmap) Local h:Int = PixmapHeight(pixmap) Repeat Local mil:Int = MilliSecs() If updateMutex Then LockMutex(updateMutex) For x = 0 Until w For y = 0 Until h 'r=Rnd(255) 'g=Rnd(255) 'b=Rnd(255) 'rgb=r*$10000 + g*$100 + b 'WritePixel(pixmap,x,y,rgb) Next Next If updateMutex Then UnlockMutex(updateMutex) Local dif:Int = MilliSecs() - mil If dif <= 100 Then DebugLog "delay..." Delay(dif) End If jumps from 2 fps to 9, (in debug with thread off) with the regular output of "delay" on the console to indicate that in fact the loop took less than 100 milliseconds. Debug binaries are for debugging - therefore there is a lot more going on behind the scenes. Release binaries are pure, "your code only" binaries, without all the extra variable/stack/etc tracking. |
| ||
Anyway, I rewrote it because your design is poor... As dreamora alludes to, you aren't doing yourself any favours using mutexes like that... and I get these results : A) THREADED=False debug disabled - 7 FPS B) THREADED=False debug enabled - 2 FPS C) THREADED=True debug disabled - solid 60 FPS :-) D) THREADED=True debug enabled - 4-5 FPS variable Here's the revised code : Note that we aren't using a shared pixmap now (oh, the very thought of it!) Instead I've implemented a basic buffering-type. Even with the bit of overhead for creating/freeing of pixmaps, it runs significantly faster. Of course it can be done much better than my basic example, but it shows how an improved design can result in improved performance... |
| ||
I think you're missing the point. The write pixel loop is SUPPOSED to be inefficient in this example (to get it to run at a slow rate) - hence why I even added a delay at the end of the loop - to force it to be even slower. This way I am able to show that sometimes your program simply has to deal with the fact that some other code may be inefficient - for example, in my Photobooth application, camera frames are returned at about 15FPS - beyond my control. I deal with this "Inefficiency" by threading that portion of the code that reads from the camera so that my entire application doesn't run at 15 FPS. And it works great with DEBUG ON - when I go to release mode, it stops working so great and I'm back to 15FPS. This demonstration was supposed to show that, but didn't - but I still found it odd that a 15x performance loss was experienced. I'm still looking for a reason for my original problem of debug running 2x faster than release on my photobooth application. (As a note, my photobooth application in no way resembles the example - the only loop in the photobooth application happens in c code where it converts from YUV to RGB24 before it is even touched in BMX, so there are no loops at all in my real-world program on the BlitzMax side of things) EDIT: Also, I dont' want to sound thankless - thank you very much for taking the time to respond. Your suggestions on using a separate pixmap (buffered or otherwise) and only locking the Mutex when assigning it to the main pixmap's handle is very good and indeed much more efficient. I will be sure to double check my Photobooth code for these types of efficiencies as I get to the point where I'm optimizing things! |
| ||
release your app with debug on... |
| ||
I'm still looking for a reason for my original problem of debug running 2x faster than release on my photobooth application Well, my first assumption would be that there are bugs in your code. |
| ||
I'm building on a Linux box (Ubuntu 10.04) can you build on windows or mac?? |
| ||
ok guys, I think I found my problem. For reasons unknown to me, I ported function which converts YUV to RGB23 from c to BMX and that fixed the oddity of debug running faster, though I'm still not sure why... Also, by doing so I went from 15 FPS to 60 FPS with debug off, and 36 FPS to 40 FPS with debug on. Strange! Thanks for the help, it really is appreciated |
| ||
Bugs, eh? :-) |
| ||
could very well be, but it was a 100% direct port of the code. The only difference is that before, I was grabbing the raw pixel information like:lockmutex(thisMutex) self.pixmap.pixels=grab_Frame() unlockmutex(thisMutex) where grab_frame was in a c file, and its self called a c function YUV2RGB24() on the raw data. Now, I do: tempPixmap:TPixmap=createpixmap(dev_Width,dev_height,PF_RGB888) tempPixmap.pixels=grab_frame() '(the c function grab_frame no longer does the YUV2RGB24 conversion) tempPixmap=YUV2RGB24(tempPixmap) lockmutex(thisMutex) self.pixmap=tempPixmap unlockmutex(thisMutex) so it turns out to be that the Mutex is locked probably too long while the YUV_>RGB24 conversion was done in C within the grab_frame function, slowing everything up. The speedup wasn't due to the port of the conversion to BMX per se, it was due more to the fact that i'm locking the mutex for a far shorter period of time. It doesn't really explain the weird results I was getting between debug/release modes, but has cured the problem none the less. Either way, your suggestions were most helpful, as you can see I incorporated your advice that it would be faster to incur the overhead of creating a new pixmap and locking only while copying the pixmap over, than to keep the mutex locked during the entire conversion. |
| ||
the Mutex is locked probably too long ... , slowing everything up. yes, I figured as much... When I first ran your code - without paying the source much attention - I was puzzled why it would be slower to run with threading enabled... My changes weren't really about optimizing things, but rather fixing the basic design. |