Unhandled memory exception error.

BlitzMax Forums/BlitzMax Programming/Unhandled memory exception error.

sswift(Posted 2009) [#1]

Global Pixels:Byte[10000*4]
Global P:Byte Ptr
Global P2:Int Ptr
Global R%, G%, B%, RGB%
Global T1%, T2%, T3%

T1 = MilliSecs()

P = Pixels

Repeat
	
	R = P[0]
	P = P + 1
	B = P[0] 
	P = P + 1
	G = P[0]
	P = P + 2
	
Until Int(P) = 10000*4

T2 = MilliSecs()

P2 = Int Ptr(Byte Ptr(Pixels))

Repeat
	
	RGB = P2[0]

	R = (RGB Shr 16) & 255 
	G = (RGB Shr 8 ) & 255
	B = (RGB       ) & 255

	P2 = P2 + 1		
	
Until Int(P2) = 10000

T3 = MilliSecs()


Print T2-T1
Print T3-T2


I'm trying to do a speed test to see if it's faster to read RGB values from a memory location using a byte pointer, or whether it's better to read an int and then use bit masks to get the values.

But I'm having some trouble figuring out how to get an int pointer to an array's data. The above code compiles, but gives me an error, so I think the conversion may be wrong. But perhaps I'm just accessing beyond the end of the array by mistake. I'm not sure.

Any ideas?


sswift(Posted 2009) [#2]
Hm, with debug mode enabled, it would appear I've screwed things up right from the get go:

P = Pixels

Repeat
	
	R = P[0]


That's where the error occurs. So for some reason P isn't the pointer to my array data, or I don't understand how to dereference a pointer correctly.


sswift(Posted 2009) [#3]
Argh, nothing I try works!

P = Varptr Pixels[0]

Repeat
	
	R = P[0]


This doesn't work either!


Brucey(Posted 2009) [#4]
P is never going to equal 10000*4.


sswift(Posted 2009) [#5]
You're right. That was foolish of me.

But right now, the problem isn't with that part of the code, but rather simply accessing the data in that array.

In the debugger, it tells me Global Pixels:Byte[]=$something, but P = $somethingelse. One would assume these two values should be the same.

Yet when I print P[0] I get the value 255, which I have placed in Pixels[0]. So it seems like it's reading the right location.

But when I try to set R = P[0] I get an unhandled memory exception error, and I can't figure out why.


sswift(Posted 2009) [#6]


Ugh this code is becoming more and more of a mess. I wish pointer usage in BlitzMax was documented better.


sswift(Posted 2009) [#7]
Ah, I see why it was giving me the error now. I was assuming it was giving the error after the first try through the loop since it always errored on that first line.

But in reality, it was looping a while bunch of times before I got the error. The error must have been occuring the minute I tried to read past the end of the array.

Dur.


sswift(Posted 2009) [#8]
Well, assuming I did this right:



It would seem that reading ints and doing all that math is around 10% faster than incrementing a byte pointer, and both methods would take around 3/4 of a second to process an image of around 3000x3000.

Hm... gotta be a way to speed this up more.


sswift(Posted 2009) [#9]


Hm.. I had the debugger on before, and I changed the way I exit the loops.

Now I get speeds of 124, 64, and 134, meaning the int pointer version is fully twice as fast, and a 3000x3000 image would actually take less than 1/10th of a second to process.


ImaginaryHuman(Posted 2009) [#10]
Hey, looks like you figured this out mostly yourself ;-D ... I would agree that reading Int's is pretty much the fastest. For me, also writing Doubles is faster than writing 2 Int's, but Int's are always by far the fastest to read, and definitely it's beneficial to read a whole int and break it up with a few commands, especially when your code is small enough that it'll be in a cache.

One thing I thought you were doing wrong, is that when you go Byte Ptr=Array, I don't know if that's really valid. An array is actually an object, a `type`, which has within it a Field where the memory address of the array data is kept. So the pointer to the array itself, is a pointer to an array object, not the contents of the field within the array type.

I would more typically do:

MyBytePtr:Byte Ptr=VarPtr(MyArray[0])

Which will ensure that you are pointing to the location in memory of the first element of the array.

Maybe you can actually just do BytePtr=Array but to me I would think that would start writing to memory based on the location of the type, not the array's actual storage space.

To speed it up, unroll your loops a bit. Do at least 4 pixels at a time in a loop - this will reduce your loop overhead. Also consider pipelining multiple memory accesses. If you read from memory and then your very next instruction uses the value which was read, this creates what's called a pipeline stall - the second command can't run until the memory has been read. I'd recommend getting ahead of yourself on the reads, like start by reading two memory locations, then start processing the first location, then when you're ready to process the second one read the third one first. This might help the cpu to be able to do some operations while it's waiting for memory accesses to finish.

I also see that you are using lots of Globals, which forces variables to be stored in memory and not CPU registers. You should use as many Local variables as possible, and you should make sure they are declared outside of any loops so that the memory management won't have to keep re-allocating them.

Also when you're &'ing 255 with stuff, I think that forces it to store the 255 as immediate values in the instruction opcodes, or in memory, so I would recommend you put your `mask` into a local variable, like Local ByteMask:Int=$FF, once outside all loops, then use that in your and'ing.

Since you're reading RGB values I presume you're actualy using ARGB format and are skipping the A? Cus that's what your code is doing.

You might also try putting the values 8 and 16 into Locals and then shift with them - not sure if that works out better or not, but my guess is that each time you define a constant value it has to store it somewhere separately each time and access that location each time, so if you put it into a single local it presumably won't have to read as much memory, or could be in a single register.

If you do several unrolled loops, you could also do like =P[0] =P[1] =P[2] =P[3] etc rather than extra adds to P. Since you're giving an offset anyway, you might as well use that to increment the position then just do P:+4 at the end, for example.

Okay, so you made me do it...
SuperStrict

Local Bank:TBank = CreateBank(10000000*4) 'Must be divisible by 16 for unrolled loop
Local Pixels:Byte[10000000*4]	'Must be divisible by 16 for unrolled loop
Local P:Byte Ptr, EP:Byte Ptr
Local P2:Int Ptr, EP2:Int Ptr
Local R%, G%, B%, RGB%, RGB2%
Local T1%, T2%, T3%, T4%


T1 = MilliSecs()

P = Pixels
EP = P + 10000000*4

Repeat
	
	R = P[0]	
	P = P + 1
	B = P[0] 
	P = P + 1
	G = P[0]
	P = P + 2

Until P = EP

T2 = MilliSecs()

P2  = Int Ptr(Byte Ptr(Pixels))
EP2 = P2 + 10000000

Local ByteMask:Int=$000000FF:Int
Local Sixteen:Int=16
Local Eight:Int=8
While P2<EP2
	
	RGB = P2[0]
	RGB2 = P2[1]
	
	R = (RGB Shr Sixteen) & ByteMask
	G = (RGB Shr Eight) & ByteMask
	B = RGB & ByteMask
	
	RGB = P2[2]
	
	R = (RGB2 Shr Sixteen) & ByteMask
	G = (RGB2 Shr Eight) & ByteMask
	B = RGB2 & ByteMask
	
	RGB2 = P2[3]
	
	R = (RGB Shr Sixteen) & ByteMask
	G = (RGB Shr Eight) & ByteMask
	B = RGB & ByteMask

	R = (RGB2 Shr Sixteen) & ByteMask
	G = (RGB2 Shr Eight) & ByteMask
	B = RGB2 & ByteMask

	P2 = P2 + 4		
	
Wend

T3 = MilliSecs()

P  = LockBank(Bank)
EP = P + 10000000*4

Repeat
	
	R = P[0]	
	P = P + 1
	B = P[0] 
	P = P + 1
	G = P[0]
	P = P + 2

Until P = EP

T4 = MilliSecs()

UnlockBank(Bank)

Print T2-T1
Print T3-T2
Print T4-T3

Notice that purely by switching to Locals, coupled with the unrolled loop 4 times, this routine is now over 50 times faster! The byte routines, unmodified, are 7 times faster.

Originally I got measurements of 97/56/122, now I get 13/1/13! Hopefully I didn't do something wrong, but if that's corret it's masssively faster.


sswift(Posted 2009) [#11]
Thanks for all that, I used to do stuff like unrolling loops years ago, but with all the cache modern PC have and all the instruction speedups that have been made, I didn't think it really had much of an effect any more.

Also the pipelining thing is new to me.

What's up with this though?


Also when you're &'ing 255 with stuff, I think that forces it to store the 255 as immediate values in the instruction opcodes, or in memory, so I would recommend you put your `mask` into a local variable, like Local ByteMask:Int=$FF, once outside all loops, then use that in your and'ing.



It's faster to access a value in a local variable, presumably stored in a register here, than to access a value stored in an instruction?


sswift(Posted 2009) [#12]
Okay I just did a new test:




It would seem that simply switching to local values resulted in the speed of the int ptr array loop going from 64ms to 8ms.

Your unrolling and pipeline stuff further increases the speed to 2ms.

So just using locals resulted in an 8x speed increase, and unrolling the loop and avoiding pipeline collisions sped that up by another 4x.

In theory this code could now process a 3000x3000 image in 2ms or 500 images per second. Though the pipelining and unrolling comes at the cost of requiring all image data do be divsible by 4. That adds a bit more complexity.

I wonder how much putting those masks in variable affects the speed of the unrolled loop...


sswift(Posted 2009) [#13]
Putting the byte masks in variables appears to have no effect on the speed.


sswift(Posted 2009) [#14]
I've also found that the pipeline optimization has no effect here. Simply unrolling the array int ptr loop at the bottom four times makes it the same speed as the loop with the pipeline optimization and variable masks.




sswift(Posted 2009) [#15]
But anyway, that doesn't matter. What matters is the unrolling and the use of local variables provides a huge speed boost. :-)


sswift(Posted 2009) [#16]
Just for the heck of it, I tried unrolling the loop more:




So it turns out that unrolling the loop 4x results in a 4x speedup. But unrolling it 16x resulted in a 24x speedup possibly due to some instruction alignment occuring.

And for the heck of it, I tried unrolling it 32x. That resulted in a 48x speedup.

The only way any of this makes sense I guess is if the instructions in the loop are insanely fast, and the jump instructions are insanely slow. Because it seems to me that unrolling even 64 or 128 times could possibly be worth it.

Doesn't make for very nice code though.

BlitzMax doesn't have some way to inline or define a block of code which would make this any cleaner, does it?

As for how to handle images which don't have a size which is divisible by whatever the unwrapping is, I guess you can just put a normal loop at the beginning or end to handle the extra. Since that will only be a few bytes it won't impact the speed.


ImaginaryHuman(Posted 2009) [#17]
I don't think it has a way built in, no. Removing more of the loop overhead also removes more `branch prediction` misses in the cpu, perhaps.

I don't know if putting constants into locals is any faster, it might not be, might even be slower, you'll have to speed-test that. Seems you did that for bytes, anyway.

It's such a tight loop to be calling it thousands/millions of times for images, that I really think it's worth doing the unrolling thing. One thing I suggest is to make a large loop which does, say, 32, 64, or even 256 iterations unrolled. Do that as many times as you can within the memory limits. Then finish off the final last few bytes that aren't aligned to those numbers in a separate regular loop. The `leftovers` will have neglible unnoticeable effect on speed overall.

Interesting about the pipelining having no apparent effect. I've seen it have an effect on my iMac on a number of occasions, but usually perhaps it's when you're doing a bit more math/manipulation inbetween memory reads. Dont' forget you are only reading/writing data right now, when you add in any kind of `processing` to the pixel values that'll decrease the speed.


sswift(Posted 2009) [#18]
The thing about unrolling the loop is this isn't the only loop I'll have in my app where I have to process an image. There's gonna be at least 15 different functions that have to process the images. Unrolling them makes them more difficult to edit, so if I do that then it's going to be hard to go back and make changes later.


sswift(Posted 2009) [#19]
Oh ho!

x1.bmx:
x = x +1



includetest.bmx:
Global X%

Function Test()

Include "x1.bmx"
Include "x1.bmx"
Include "x1.bmx"

End Function

Test()

Print X



Building includetest
Compiling:includetest.bmx
flat assembler version 1.66
3 passes, 2019 bytes.
Linking:includetest.exe
Executing:includetest.exe
3

Process complete


sswift(Posted 2009) [#20]
Hm... including like that works, but without the ability to increment the pointer array index, it doesn't solve my problem.

It's kinda strange too, that doing:

Pointer[0]
Pointer[1]
Pointer[2]

Pointer + 3

IS slower than:

Pointer + 1
Pointer + 1
Pointer + 1

What I mean is, Pointer[2] is adding two to the pointer. Internally, it should be an add operation. But Pointer[0] doesn't add anything. So incrementing the pointer after each operation should be just as fast as incrementing the index with constants and then incrementing the pointer.

But instead it's like 5x slower.

Is there some other way to dereference a pointer besides accessing it like an array?


ImaginaryHuman(Posted 2009) [#21]
Interesting ... I wonder if at a lower level (ie asm) the pointer[0]'s all convert to a straight memory access, whereas pointer[1] or above requires an indirect offset to be added - ie different addressing mode.

The only way to dereference is like an array with square brackets. I have no idea if this'll work but what happens if you do pointer[] ?


sswift(Posted 2009) [#22]
Doing that gives an error.