Debug build runs 2x faster!!?

BlitzMax Forums/BlitzMax Programming/Debug build runs 2x faster!!?

jondecker76(Posted 2010) [#1]
I'm working on my first commercial application in many years. I'm about 75% complete, so today I decided to turn the debug build option off to see how much faster it would run, and how much smaller the executable would be.

To my surprise, with debug turned off, my application runs about half as fast as it does with debug on! (about 35 FPS with debug on, and only 15 FPS with debug off)

I'm building on a Linux box (Ubuntu 10.04) and my application is threaded. My initial thoughts are that perhaps there is a bug with Threads in linux in that they only work with debug enabled... (I'm guessing this because my application ran at about 15 FPS before threading with debug enabled, and 35 FPS with threading with debug enabled.... Now that debug is disabled, it almost acts like it lost the performance gains of threading)


EDIT:
Just did some more verification... Threading appears to act different in Linux depending on whether or not Debug build is enabled. Can mod move this to bug reports? I can't post the code, as this is a commercial project - though I would send it to mark himself to verify

FYI: This is a Photobooth application, and I threaded only the part which grabs the images from the attached camera. Since the camera runs at 15 FPS, it caused my entire program to run at 15FPS if that part isn't threaded... This is the reason for the huge performance gain using threads in my application


jondecker76(Posted 2010) [#2]
Ok, whipped up a quick example illustrating that there is indeed a difference in threading with and without debug enabled... (although in this example, threading seems to do nothing with Debug build enabled, but in my other application, the effect is opposite!!)

Could others run this, and post their results for:
A) THREADED=False debug disabled
B) THREADED=False debug enabled
C) THREADED=True debug disabled
D) THREADED=True debug enabled

I'll go first...
A) 5 fps
B) 3 FPS
C) 36 FPS
D) 6 FPS


...Here is the test code...
'Threaded build bug demonstration
'Shows that threads behave differently depending on whether or not Debug build is enabled!
'As a side effect, shows how to make your program so that it can be run threaded or non-threaded!
SuperStrict

'Use threads or no?? Change this to see that:
'A) The program will run SLOW while THREADED=false
'B) The program will run fast (with slow pixmap updates) while THREADED=true and Dubug build is on
'C) The program will run SLOW AGAIN while THREADED=true and Debug build is off
Const THREADED:Int = False

Global updateThread:TThread
Global updateMutex:TMutex
If THREADED
	updateMutex=CreateMutex()
EndIf


'This is the target pixmap that will be "randomized" by the UpdatePixmap() function
Global pixmap:TPixmap=CreatePixmap(640,480,PF_RGB888)

'Here is a function that is designed to opdate a pixmap at about 5-10 FPS
'If put inline, the entire program will run at about 5-10 FPS!
'If threaded, then the pixmap will still only update at 5-10FPS, but the rest of the
'program in the main thread will run as fast as possible.
Function UpdatePixmap:Object(data:Object)
	Local r:Int,g:Int,b:Int
	Local rgb:Int
	Local x:Int,y:Int

	Repeat
		If updateMutex Then LockMutex(updateMutex)
		
		For x = 0 To PixmapWidth(pixmap)-1
			For y = 0 To PixmapHeight(pixmap)-1
				r=Rnd(255)
				g=Rnd(255)
				b=Rnd(255)
				rgb=r*$10000 + g*$100 + b
				WritePixel(pixmap,x,y,rgb)
			Next
		Next
		If updateMutex Then UnlockMutex(updateMutex)
	
		Delay(100)
	Until String(data)<>"" 'The way we will call the function, is pass "" if we are threaded, and anything else if we are not threaded so we fall through!
End Function


Graphics 640,480

'We are THREADED, so lets thart the update thread!
If THREADED
	updateThread=CreateThread(UpdatePixmap,"")
EndIf

While Not KeyDown(KEY_ESCAPE)
	Cls
	
	'We are NOT THREADED, so lets call UpdatePixmap every iteration! (bogus string parameter is necessary to tell the function that we are not being called as a thread!)
	If Not THREADED
		UpdatePixmap("blah")
	EndIf
	
	'Lets draw the pixmap - make sure we lock/unlock our Mutex if we are threaded!
	If updateMutex Then LockMutex(updateMutex)
	DrawPixmap(pixmap,0,0)
	If updateMutex Then UnlockMutex(updateMutex)
	
	'Now lets do something else so that we can see the framerate difference!
	SetColor 255,255,255
	DrawOval(320+(Sin(MilliSecs()/10)*100.0),240,20,20)
	
	UpdateFrameCounter() 'FPS calculations....
	DisplayInfo()
	Flip
End While

'FPS Code
Global FramesPerSec:Int
Global FrameCounter:Int
Global fpsStart:Int
Function UpdateFrameCounter()
  FrameCounter:+1
  If MilliSecs() > fpsStart + 1000
    fpsStart = MilliSecs()
    FramesPerSec = FrameCounter
    FrameCounter = 0
  EndIf
End Function

Function DisplayInfo()
	Local isThreaded:String
	Local isDebug:String
	
	If THREADED
		isThreaded="TRUE"
	Else
		isThreaded="FALSE"
	EndIf
	
	isDebug="FALSE"
	?debug
	isDebug="TRUE"
	?
	SetColor 0,0,0
	DrawRect 0,0,150,70
	SetColor 255,255,255
	DrawText "FPS: " + FramesPerSec,10,10
	DrawText "Threaded: " + isThreaded,10,30
	DrawText "Debug: " + isDebug,10,50
	 
End Function



Muttley(Posted 2010) [#3]
Test machine: Windows 7 Ultimate (64bit) on a Lenovo Z61p, Intel Core2 2GHz CPU, 3GB RAM

A) THREADED=False debug disabled - 5 FPS
B) THREADED=False debug enabled - 2 FPS
C) THREADED=True debug disabled - 38 FPS
D) THREADED=True debug enabled - 2 FPS


jondecker76(Posted 2010) [#4]
wow, 38FPS down to 2FPS just by enabling debug on a threaded application! This surely has to be some kind of bug! (I'm still scratching my head as to why on my real world application, its actually faster in dubug, and slows to a crawl when debug is disabled...)


Dreamora(Posted 2010) [#5]
not really a bug at least not from this example that primarily is a good collection of DONT and NEVER DO in relation to threading.


jondecker76(Posted 2010) [#6]
I still think its a bug, debug build simply should not incur a 15x performance loss. On top of that, my sample above does show that threaded applications behave quite differently depending on whether debug build is set or not (I mean how can we do a debug build to test our development if it drags it down to 2FPS???). Even on top of that, in an application I am currently writing, Debug build runs 2x faster than non-debug on a threaded application (though I haven't been able to do a simple example that reproduces this exact behavior) - this again shows that something weird is happening with threaded apps depending on whether debug build is set or not.

Again, I'd like to have a mod move this to Bug Reports (as I'd rather not cross-post)

thanks


Nate the Great(Posted 2010) [#7]
@ dreamora what is wrong with his code? in other words, what did he do that goes under dont and never do?


Brucey(Posted 2010) [#8]
Debug will always run more slowly than Release... That is the nature of having all the extra debug program flow. If you have LOTS of inefficient loops, debug will appear to run much more slowly than release.


jondecker76(Posted 2010) [#9]
yes, but not 15x slower (30+ FPS down to 2 FPS) on a SIMPLE example. Also, the example I posted isn't filled with inefficient loops. This is common sense that something isn't right, seriously.

And, once again I will say that I also have a program I am working on where the opposite is true (dubug runs 2x faster with another threaded app). The point is that it is inconsistent, and behaves differently depending on whether Debug build is set or not.


Brucey(Posted 2010) [#10]
Also, the example I posted isn't filled with inefficient loops.

I beg to differ...

Your writepixel loop is very inefficient... in fact, it takes much longer than 100 millisecs to run on my PC... In debug mode, you will end up with something along the lines of :
 Basic execution     : 307,200 times
 debugs to determine pixmap height + width : 614,400 times
 Debug per statement : at least 1 call each (it's much more in most cases), of 5 statements... 1,536,000

 Minimal total per call : 2,457,600


Also, Rnd() is not the most efficient function in the world.


A slightly modified example of the function :
	Local w:Int = PixmapWidth(pixmap)
	Local h:Int = PixmapHeight(pixmap)

	Repeat
		Local mil:Int = MilliSecs()
		If updateMutex Then LockMutex(updateMutex)
		
		For x = 0 Until w
			For y = 0 Until h
				'r=Rnd(255)
				'g=Rnd(255)
				'b=Rnd(255)
				'rgb=r*$10000 + g*$100 + b
				'WritePixel(pixmap,x,y,rgb)
			Next
		Next
		If updateMutex Then UnlockMutex(updateMutex)
	
		Local dif:Int = MilliSecs() - mil
	
		If dif <= 100 Then
		DebugLog "delay..."
			Delay(dif)
		End If

jumps from 2 fps to 9, (in debug with thread off) with the regular output of "delay" on the console to indicate that in fact the loop took less than 100 milliseconds.

Debug binaries are for debugging - therefore there is a lot more going on behind the scenes. Release binaries are pure, "your code only" binaries, without all the extra variable/stack/etc tracking.


Brucey(Posted 2010) [#11]
Anyway, I rewrote it because your design is poor... As dreamora alludes to, you aren't doing yourself any favours using mutexes like that...

and I get these results :

A) THREADED=False debug disabled - 7 FPS
B) THREADED=False debug enabled - 2 FPS
C) THREADED=True debug disabled - solid 60 FPS :-)
D) THREADED=True debug enabled - 4-5 FPS variable


Here's the revised code :


Note that we aren't using a shared pixmap now (oh, the very thought of it!)
Instead I've implemented a basic buffering-type. Even with the bit of overhead for creating/freeing of pixmaps, it runs significantly faster.

Of course it can be done much better than my basic example, but it shows how an improved design can result in improved performance...


jondecker76(Posted 2010) [#12]
I think you're missing the point. The write pixel loop is SUPPOSED to be inefficient in this example (to get it to run at a slow rate) - hence why I even added a delay at the end of the loop - to force it to be even slower. This way I am able to show that sometimes your program simply has to deal with the fact that some other code may be inefficient - for example, in my Photobooth application, camera frames are returned at about 15FPS - beyond my control. I deal with this "Inefficiency" by threading that portion of the code that reads from the camera so that my entire application doesn't run at 15 FPS. And it works great with DEBUG ON - when I go to release mode, it stops working so great and I'm back to 15FPS. This demonstration was supposed to show that, but didn't - but I still found it odd that a 15x performance loss was experienced.

I'm still looking for a reason for my original problem of debug running 2x faster than release on my photobooth application. (As a note, my photobooth application in no way resembles the example - the only loop in the photobooth application happens in c code where it converts from YUV to RGB24 before it is even touched in BMX, so there are no loops at all in my real-world program on the BlitzMax side of things)


EDIT:
Also, I dont' want to sound thankless - thank you very much for taking the time to respond. Your suggestions on using a separate pixmap (buffered or otherwise) and only locking the Mutex when assigning it to the main pixmap's handle is very good and indeed much more efficient. I will be sure to double check my Photobooth code for these types of efficiencies as I get to the point where I'm optimizing things!


Zeke(Posted 2010) [#13]
release your app with debug on...


Brucey(Posted 2010) [#14]
I'm still looking for a reason for my original problem of debug running 2x faster than release on my photobooth application

Well, my first assumption would be that there are bugs in your code.


Zeke(Posted 2010) [#15]
I'm building on a Linux box (Ubuntu 10.04)

can you build on windows or mac??


jondecker76(Posted 2010) [#16]
ok guys, I think I found my problem. For reasons unknown to me, I ported function which converts YUV to RGB23 from c to BMX and that fixed the oddity of debug running faster, though I'm still not sure why...
Also, by doing so I went from 15 FPS to 60 FPS with debug off, and 36 FPS to 40 FPS with debug on. Strange!

Thanks for the help, it really is appreciated


Brucey(Posted 2010) [#17]
Bugs, eh? :-)


jondecker76(Posted 2010) [#18]
could very well be, but it was a 100% direct port of the code. The only difference is that before, I was grabbing the raw pixel information like:

lockmutex(thisMutex)
self.pixmap.pixels=grab_Frame()
unlockmutex(thisMutex)


where grab_frame was in a c file, and its self called a c function YUV2RGB24() on the raw data.

Now, I do:
tempPixmap:TPixmap=createpixmap(dev_Width,dev_height,PF_RGB888)
tempPixmap.pixels=grab_frame() '(the c function grab_frame no longer does the YUV2RGB24 conversion)
tempPixmap=YUV2RGB24(tempPixmap)
lockmutex(thisMutex)
self.pixmap=tempPixmap
unlockmutex(thisMutex)


so it turns out to be that the Mutex is locked probably too long while the YUV_>RGB24 conversion was done in C within the grab_frame function, slowing everything up. The speedup wasn't due to the port of the conversion to BMX per se, it was due more to the fact that i'm locking the mutex for a far shorter period of time. It doesn't really explain the weird results I was getting between debug/release modes, but has cured the problem none the less. Either way, your suggestions were most helpful, as you can see I incorporated your advice that it would be faster to incur the overhead of creating a new pixmap and locking only while copying the pixmap over, than to keep the mutex locked during the entire conversion.


Brucey(Posted 2010) [#19]
the Mutex is locked probably too long ... , slowing everything up.

yes, I figured as much...

When I first ran your code - without paying the source much attention - I was puzzled why it would be slower to run with threading enabled...
My changes weren't really about optimizing things, but rather fixing the basic design.