Same code, different speeds.
BlitzMax Forums/BlitzMax Programming/Same code, different speeds.
| ||
SuperStrict Local T0%, T1%, T2%, T3% Local SrcPtr:Byte Ptr, SrcPtr2:Float Ptr Local Loop%, Loops% = 10000 Local Offset% Local F# SrcPtr = MemAlloc(40000*4) T0 = MilliSecs() For Loop = 1 To Loops For Offset = 0 To 160000-1 F# = Float Ptr(SrcPtr)[Offset] Next Next T1 = MilliSecs() For Loop = 1 To Loops For Offset = 0 To 160000-1 F# = Float Ptr(SrcPtr)[Offset] Next Next T2 = MilliSecs() Print "Test1 = " + (T1-T0) Print "Test2 = " + (T2-T1) First loop, 2 seconds. Second loop, 1.3 seconds. I have a guess as to why it's occuring, (the code of one is badly aligned) but it shouldn't be. |
| ||
Happens in this example too, twice:SuperStrict Local T0%, T1%, T2%, T3%, T4% Local SrcPtr:Byte Ptr[3], SrcPtr2:Float Ptr Local Loop%, Loops% = 10000 Local Offset% Local F# SrcPtr[0] = MemAlloc(40000*4) T0 = MilliSecs() For Loop = 1 To Loops For Offset = 0 To 160000-1 F# = Float Ptr(SrcPtr[0])[Offset] Next Next T1 = MilliSecs() For Loop = 1 To Loops For Offset = 0 To 160000-1 F# = Float Ptr(SrcPtr[0])[Offset] Next Next T2 = MilliSecs() For Loop = 1 To Loops SrcPtr2 = Float Ptr(SrcPtr[0]) For Offset = 0 To 160000-1 F# = SrcPtr2[Offset] Next Next T3 = MilliSecs() For Loop = 1 To Loops SrcPtr2 = Float Ptr(SrcPtr[0]) For Offset = 0 To 160000-1 F# = SrcPtr2[Offset] Next Next T4 = MilliSecs() Print "Test1 = " + (T1-T0) Print "Test2 = " + (T2-T1) Print "Test3 = " + (T3-T2) Print "Test4 = " + (T4-T3) |
| ||
Well, if it was me, I'd be a tad worried about accessing memory I hadn't allocated... But that's only me. |
| ||
Your code doesn't even run for me. You malloc enough memory for 40000 floats, then try to index through 160,000 of them. |
| ||
Indeed. |
| ||
Well I fixed and ran sample 1. 10k iterations through 40k floats. Debug runs around 16.1 and 16.2 seconds. Release runs at around 220ms for both tests +/- 5ms either way, but I'd like to dig into the asm because I very much doubt it's doing the iterations at all because F is never referenced outside the loops. So, I changed "F =" in the inner loop to "F :+" , then printed the results outside the loop. Now there's no way the compiler could optimise these loops away. [--EDIT--] I now know it is doing the iterations. [--EDIT--] I made some other minor changes too, like turning off GC and resetting F outside of the timed code. Both tests now run around 1950ms. Test 1 is always few ms slower, but app cache is not in the same state 2 seconds after starting (when test 2 runs) as it is at the very beginning. |
| ||
Well, if it was me, I'd be a tad worried about accessing memory I hadn't allocated... Whoops. I converted code that used byte pointers and I made a boo boo. :-) |
| ||
I made some other minor changes too, like turning off GC and resetting F outside of the timed code. Both tests now run around 1950ms. Test 1 is always few ms slower, but app cache is not in the same state 2 seconds after starting (when test 2 runs) as it is at the very beginning. Those changes you made may well have changed the offset of the compiled code, which would invalidate the test. However, I did try simply adding a Print F# to the very end of the code I posted. In theory that should not affect the offsets of the code above, yet it affected the output. The two loops run at roughly the same speed with that one change. Presumably some kind of optimization was going on. I tried adding another loop for the heck of it but that one runs at the same speed too. So I dunno. |
| ||
small guess for your first example: cpu lvl2 cache caches the content for the second loop |
| ||
However, I did try simply adding a Print F# to the very end of the code I posted. In theory that should not affect the offsets of the code above, yet it affected the output. The two loops run at roughly the same speed with that one change. Presumably some kind of optimization was going on. I tried adding another loop for the heck of it but that one runs at the same speed too. So I dunno. (bolded by me) Just for clarity: Did you fix that initial alloc-error too? So it was two changes in total? Or did you run it with the to small alloc but with only "Print F#" added? |
| ||
I ran your initial code with only the alloc line changed like this: SrcPtr = MemAlloc(160000 * SizeOf(F)) (release compile inside blide) Testrun#1: Test1 = 653 Test2 = 617 Testrun#2: Test1 = 630 Test2 = 611 Testrun#3: Test1 = 628 Test2 = 613 |
| ||
Just for clarity: Did you fix that initial alloc-error too? So it was two changes in total? Or did you run it with the to small alloc but with only "Print F#" added I changed the loops to only loop 40,000 times too. But that change alone didn't make them run at the same speed. |
| ||
My guess also is that the data is at least partly in the cache and loads faster. |
| ||
My guess also is that the data is at least partly in the cache and loads faster. I find this explanation very plausible. Look at my results when i run with 3 loops: Run#1: Test1 = 637 Test2 = 626 Test3 = 628 Run#2: Test1 = 628 Test2 = 623 Test3 = 619 Run#3: Test1 = 638 Test2 = 621 Test3 = 620 |