Same code, different speeds.

BlitzMax Forums/BlitzMax Programming/Same code, different speeds.

sswift(Posted 2009) [#1]
SuperStrict

Local T0%, T1%, T2%, T3%
Local SrcPtr:Byte Ptr, SrcPtr2:Float Ptr
Local Loop%, Loops% = 10000
Local Offset%
Local F#

SrcPtr = MemAlloc(40000*4) 


T0 = MilliSecs()

	For Loop = 1 To Loops

		For Offset = 0 To 160000-1
			F# = Float Ptr(SrcPtr)[Offset]
		Next
		
	Next
	
	
T1 = MilliSecs()
	
	For Loop = 1 To Loops

		For Offset = 0 To 160000-1
			F# = Float Ptr(SrcPtr)[Offset]
		Next
		
	Next

	
T2 = MilliSecs()

Print "Test1 = " + (T1-T0)
Print "Test2 = " + (T2-T1)


First loop, 2 seconds. Second loop, 1.3 seconds.

I have a guess as to why it's occuring, (the code of one is badly aligned) but it shouldn't be.


sswift(Posted 2009) [#2]
Happens in this example too, twice:

SuperStrict

Local T0%, T1%, T2%, T3%, T4%
Local SrcPtr:Byte Ptr[3], SrcPtr2:Float Ptr
Local Loop%, Loops% = 10000
Local Offset%
Local F#

SrcPtr[0] = MemAlloc(40000*4) 

T0 = MilliSecs()

	For Loop = 1 To Loops

		For Offset = 0 To 160000-1
			F# = Float Ptr(SrcPtr[0])[Offset]
		Next
		
	Next

	
T1 = MilliSecs()
		
	For Loop = 1 To Loops
		
		For Offset = 0 To 160000-1
			F# = Float Ptr(SrcPtr[0])[Offset]
		Next
		
	Next

	
T2 = MilliSecs()

	For Loop = 1 To Loops

		SrcPtr2 = Float Ptr(SrcPtr[0])	
		
		For Offset = 0 To 160000-1
			F# = SrcPtr2[Offset]	
		Next
		
	Next
	
T3 = MilliSecs()

	For Loop = 1 To Loops

		SrcPtr2 = Float Ptr(SrcPtr[0])	
		
		For Offset = 0 To 160000-1
			F# = SrcPtr2[Offset]	
		Next
		
	Next
	
T4 = MilliSecs()

Print "Test1 = " + (T1-T0)
Print "Test2 = " + (T2-T1)
Print "Test3 = " + (T3-T2)
Print "Test4 = " + (T4-T3)



Brucey(Posted 2009) [#3]
Well, if it was me, I'd be a tad worried about accessing memory I hadn't allocated...

But that's only me.


matibee(Posted 2009) [#4]
Your code doesn't even run for me. You malloc enough memory for 40000 floats, then try to index through 160,000 of them.


Brucey(Posted 2009) [#5]
Indeed.


matibee(Posted 2009) [#6]
Well I fixed and ran sample 1. 10k iterations through 40k floats. Debug runs around 16.1 and 16.2 seconds.

Release runs at around 220ms for both tests +/- 5ms either way, but I'd like to dig into the asm because I very much doubt it's doing the iterations at all because F is never referenced outside the loops.

So, I changed "F =" in the inner loop to "F :+" , then printed the results outside the loop. Now there's no way the compiler could optimise these loops away.

[--EDIT--] I now know it is doing the iterations. [--EDIT--]

I made some other minor changes too, like turning off GC and resetting F outside of the timed code.

Both tests now run around 1950ms. Test 1 is always few ms slower, but app cache is not in the same state 2 seconds after starting (when test 2 runs) as it is at the very beginning.




sswift(Posted 2009) [#7]
Well, if it was me, I'd be a tad worried about accessing memory I hadn't allocated...


Whoops. I converted code that used byte pointers and I made a boo boo. :-)


sswift(Posted 2009) [#8]
I made some other minor changes too, like turning off GC and resetting F outside of the timed code.

Both tests now run around 1950ms. Test 1 is always few ms slower, but app cache is not in the same state 2 seconds after starting (when test 2 runs) as it is at the very beginning.



Those changes you made may well have changed the offset of the compiled code, which would invalidate the test.

However, I did try simply adding a Print F# to the very end of the code I posted. In theory that should not affect the offsets of the code above, yet it affected the output. The two loops run at roughly the same speed with that one change. Presumably some kind of optimization was going on. I tried adding another loop for the heck of it but that one runs at the same speed too. So I dunno.


Kurator(Posted 2009) [#9]
small guess for your first example: cpu lvl2 cache caches the content for the second loop


Mahan(Posted 2009) [#10]


However, I did try simply adding a Print F# to the very end of the code I posted. In theory that should not affect the offsets of the code above, yet it affected the output. The two loops run at roughly the same speed with that one change. Presumably some kind of optimization was going on. I tried adding another loop for the heck of it but that one runs at the same speed too. So I dunno.



(bolded by me)

Just for clarity: Did you fix that initial alloc-error too? So it was two changes in total? Or did you run it with the to small alloc but with only "Print F#" added?


Mahan(Posted 2009) [#11]
I ran your initial code with only the alloc line changed like this:

SrcPtr = MemAlloc(160000 * SizeOf(F))

(release compile inside blide)

Testrun#1:

Test1 = 653
Test2 = 617


Testrun#2:

Test1 = 630
Test2 = 611


Testrun#3:
Test1 = 628
Test2 = 613


sswift(Posted 2009) [#12]
Just for clarity: Did you fix that initial alloc-error too? So it was two changes in total? Or did you run it with the to small alloc but with only "Print F#" added


I changed the loops to only loop 40,000 times too. But that change alone didn't make them run at the same speed.


ImaginaryHuman(Posted 2009) [#13]
My guess also is that the data is at least partly in the cache and loads faster.


Mahan(Posted 2009) [#14]

My guess also is that the data is at least partly in the cache and loads faster.



I find this explanation very plausible. Look at my results when i run with 3 loops:

Run#1:
Test1 = 637
Test2 = 626
Test3 = 628

Run#2:
Test1 = 628
Test2 = 623
Test3 = 619

Run#3:
Test1 = 638
Test2 = 621
Test3 = 620