Using a large look up array

BlitzMax Forums/BlitzMax Programming/Using a large look up array

Jur(Posted 2012) [#1]
I am trying to optimize pixmap operations and I have found out that a big bottleneck in such operations is conversion from floats to integers. Calculations usually needs some divisions so floats are utilized, but at the end you always need to write integer (or byte) into a color value.

To get rid of divisions I made a big array (16MB) with all possible variants for divisions. With this method only integer operations are performed but also a lot of the table look ups. Result: it works now 200-300% faster! That is a huge improvement! But... i have read that such method can results in cache thrashing and that, I read, is not a good thing. Can anybody recommend me what to do? I have a great improvement in performance so i would like to keep that method, but I dont want any weird problems later.

Here is the code - functions are for alpha blending of two RGBA pixmaps of same size with additional opacity for both. There are two versions - an ordinary (with divisions) and the one which read from look up table.




Yasha(Posted 2012) [#2]
Calculations usually needs some divisions so floats are utilized


Why? This line:

Local A:Float =  255-(255-Atop)*(255-Abottom)/255


...will put a whole number in A, because the operations were all integer operations anyway.

The next three lines will generate fractional values, because now they're using all-float operations, but since the values are cast straight back to integer when the results get put in bottomPtr[n], that fractional part goes unused.

In other words, perhaps I'm just astonishingly bad at arithmetic, but is there a reason why you can't just declare A:Int instead of A:Float and be done with it?


Jesse(Posted 2012) [#3]
why don't you use a byte ptr to store the divide table instead of arrays, Arrays are objects and object carry a bit of overhead which makes them quite a bit slower than byte pointers. You can allocate bytes of memory with "MemAlloc" which would be more memory efficient and a bit faster to process. Even if you have to make the two dimension calculations, it will still be quite a bit faster.

Last edited 2012


Jur(Posted 2012) [#4]
@Yasha
I have a float so I dont need to check for divide by zero. But calculations are not critical either with floats or integers as long as there is no conversion (or casting) from float to integer. I have made a test and with a pixmap of 1000*1000 pixels all calculation are performed within 1 millisecs. It was the last step in the loop - casting a float value into a byte pointer which took 220 ms.

@Jesse
Nice idea. I made another function which now use "memory table" and it is indeed faster. Here are all variants (+ added a function which use ReadPixel/WritePixel just to compare):



The results in millisecond (release mode, Athlon X2 3600+):

WritePixel/ReadPixel time :1366
Normal time :844
ArrayTable time :340
MemoryTable time :313

Last edited 2012


Jur(Posted 2012) [#5]
But the original dilemma was about the cash thrashing. Actually I dont know what that really is, but I assume something like that : a value is read from memory and written in cash for quicker access. If there is too much random values written into cash constantly, then the cash need so much updating that the entire system slows down.
The question is can such a thing occur by heavy use of look up tables in a way I use them?


JBR(Posted 2012) [#6]
WritePixel/ReadPixel time :7841
Normal time :7740
ArrayTable time :58
MemoryTable time :59

This is on an i7 2700K & 4.8GHz :-)


ImaginaryHuman(Posted 2012) [#7]
Can you try using fixed point instead of float lookup tables? Then you can use a very fast Shr or Shl to convert between fixed point and regular integers.

e.g. If you only need values 0..255 in your image data, then perhaps use 16:16 fixed point.

Reading 120 from the pixmap becomes (120 Shl 16), writing 120 to the pixmap becomes (120 Shr 16). Then you can do division, multiplication and other math also just by using integers, and bypass the need for a lookup table which is surely slower than working purely with integer locals.


Jur(Posted 2012) [#8]
Hm, I know nothing about fixed point but I will check it out. Having only local integer variables would be the best solution. Actually that would be so good that I doubt it is possible (without some serious drawbacks).


ImaginaryHuman(Posted 2012) [#9]
?

It depends only on what values you work with. If all your numbers are in the 0..255 range, or even in a 0..32767 range, you can use 16:16 fixed point format. I bet it's faster than what you're doing with memory accesses.


Yasha(Posted 2012) [#10]
There is no difference (in implementation) between fixed point and integers. A fixed-point number is an integer + scale factor.

In this case, it just means an integer Shl 16. The important thing is that it would still be vulnerable to hardware exceptions.

In my not-remotely-humble opinion, if division by zero is going to be an issue, I would suggest making sure the divisor is never zero (e.g. maybe make it one instead?). Otherwise, you're stuck.


BinaryBurst(Posted 2012) [#11]
Try this version. Got it the straight way. It is faster than any of the above versions, but I'll post the assembly version which should make things a lot faster. Oh, by the way, the bottleneck is memory latency because the pixmap is too big to fit in the cache.


Global DivideTable:Byte[][]
DivideTable = DivideTable[..256*256]
For Local i:Int=0 Until DivideTable.length
	DivideTable[i] = DivideTable[i][..256]
Next

For Local value:Int=0 Until DivideTable.length
	For Local faktor:Int=0 Until DivideTable[value].length
		If faktor>0 Then
			Local div:Int = value/faktor
			div = Min(div,255)
			div = Max(div,0)
			DivideTable[value][faktor] = Byte(div)
		EndIf
	Next
Next


Global DivideMemory:Byte Ptr = MemAlloc(256*256*256)
Local offset:Int
For Local j:Int=0 Until 256
	For i:Int=0 Until 65536
		If j>0 Then 
			div:Int = i/j
			div = Min(div,255)
			div = Max(div,0)
			DivideMemory[offset] = Byte(div)
		Else
			DivideMemory[offset] = 0	
		EndIf
		offset:+1
	Next
Next


Local Pixmap:TPixmap=CreatePixmap(2000,2000, PF_RGBA8888)
Local PixmapA:TPixmap = CopyPixmap(Pixmap)
Local PixmapB:TPixmap = CopyPixmap(Pixmap)
Local PixmapC:TPixmap = CopyPixmap(Pixmap)
Local PixmapD:TPixmap = CopyPixmap(Pixmap)
Local PixmapE:TPixmap = CopyPixmap(Pixmap)

Local Pixmap2:TPixmap=CreatePixmap(2000,2000, PF_RGBA8888)


Local time:Int=MilliSecs()
PastePixmap_ReadWritePixel(PixmapA,0.6,Pixmap2,0.2)
Print "WritePixel/ReadPixel time :"+(MilliSecs()-time)

time:Int=MilliSecs()
PastePixmap_Normal(PixmapB,0.6,Pixmap2,0.2)
Print "Normal time :"+(MilliSecs()-time)

time:Int=MilliSecs()
PastePixmap_ArrayTable(PixmapC,0.6,Pixmap2,0.2)
Print "ArrayTable time :"+(MilliSecs()-time)

time:Int=MilliSecs()
PastePixmap_MemoryTable(PixmapD,0.6,Pixmap2,0.2)
Print "MemoryTable time :"+(MilliSecs()-time)

time:Int=MilliSecs()
p_argb8888(PixmapE,0.6,Pixmap2,0.2)
Print "Straight way time :"+(MilliSecs()-time)

Function PastePixmap_ReadWritePixel(PixmapBottom:TPixmap, _opacityBottom:Float, PixmapTop:TPixmap, _opacityTop:Float)
	
	For Local y:Int=0 Until PixmapBottom.height
		For Local x:Int=0 Until PixmapBottom.width
			
			Local BGRAtop:Int = ReadPixel(PixmapTop,x,y)
			Local Btop:Int = BGRAtop & $FF
			Local Gtop:Int = (BGRAtop Shr 8) & $FF
			Local Rtop:Int = (BGRAtop Shr 16) & $FF
			Local Atop:Int = ((BGRAtop Shr 24) & $FF) * _opacityTop
			
			Local BGRAbottom:Int = ReadPixel(PixmapBottom,x,y)
			Local Bbottom:Int = BGRAbottom & $FF
			Local Gbottom:Int = (BGRAbottom Shr 8) & $FF
			Local Rbottom:Int = (BGRAbottom Shr 16) & $FF
			Local Abottom:Int = ((BGRAbottom Shr 24) & $FF) * _opacityBottom

			Local A:Float =  255-(255-Atop)*(255-Abottom)/255
			Local R:Int = Atop*Rtop/A + (Abottom*Rbottom*((255-Atop)/A))/255
			Local G:Int = Atop*Gtop/A + (Abottom*Gbottom*((255-Atop)/A))/255
			Local B:Int = Atop*Btop/A + (Abottom*Bbottom*((255-Atop)/A))/255
			
			WritePixel(PixmapBottom,x,y, B | (G Shl 8) | (R Shl 16) | (Int(A) Shl 24))
		Next
	Next
EndFunction




Function PastePixmap_Normal(PixmapBottom:TPixmap, _opacityBottom:Float, PixmapTop:TPixmap, _opacityTop:Float)
	
	Local bottomPtr:Byte Ptr = PixmapPixelPtr(PixmapBottom)
	Local topPtr:Byte Ptr = PixmapPixelPtr(PixmapTop)
	
	
	For Local y:Int=0 Until PixmapBottom.height
		For Local x:Int=0 Until PixmapBottom.width
			
			Local Atop:Int = topPtr[3]*_opacityTop
			Local Abottom:Int = bottomPtr[3]*_opacityBottom
			Local A:Float =  255-(255-Atop)*(255-Abottom)/255
			bottomPtr[0] = Atop*topPtr[0]/A + (Abottom*bottomPtr[0]*((255-Atop)/A))/255
			bottomPtr[1] = Atop*topPtr[1]/A + (Abottom*bottomPtr[1]*((255-Atop)/A))/255
			bottomPtr[2] = Atop*topPtr[2]/A + (Abottom*bottomPtr[2]*((255-Atop)/A))/255
			bottomPtr[3] = Int(A)
			topPtr:+4
			bottomPtr:+4
		Next
	Next
EndFunction	
	
	

Function PastePixmap_ArrayTable(PixmapBottom:TPixmap, _opacityBottom:Float, PixmapTop:TPixmap, _opacityTop:Float)
	
	Local bottomPtr:Byte Ptr = PixmapPixelPtr(PixmapBottom)
	Local topPtr:Byte Ptr = PixmapPixelPtr(PixmapTop)

	Local opacityBottom:Int = _opacityBottom*255
	Local opacityTop:Int = _opacityTop*255
	
	For Local y:Int=0 Until PixmapBottom.height
		For Local x:Int=0 Until PixmapBottom.width
			
			Local Atop:Int = DivideTable[topPtr[3]*opacityTop][255]
			Local Abottom:Int = DivideTable[bottomPtr[3]*opacityBottom][255]
			Local A:Int =  255-DivideTable[(255-Atop)*(255- Abottom)][255]
			Local div:Int = DivideTable[Abottom*(255-Atop)][A]
			
			bottomPtr[0] = DivideTable[Atop*topPtr[0]][A] + DivideTable[bottomPtr[0]*div][255]
			bottomPtr[1] = DivideTable[Atop*topPtr[1]][A] + DivideTable[bottomPtr[1]*div][255]
			bottomPtr[2] = DivideTable[Atop*topPtr[2]][A] + DivideTable[bottomPtr[2]*div][255]
			bottomPtr[3] = A

			topPtr:+4
			bottomPtr:+4
		Next
	Next
EndFunction


Function PastePixmap_MemoryTable(PixmapBottom:TPixmap, _opacityBottom:Float, PixmapTop:TPixmap, _opacityTop:Float)
	
	Local bottomPtr:Byte Ptr = PixmapPixelPtr(PixmapBottom)
	Local topPtr:Byte Ptr = PixmapPixelPtr(PixmapTop)
	

	Local opacityBottom:Int = _opacityBottom*255
	Local opacityTop:Int = _opacityTop*255
	
	For Local y:Int=0 Until PixmapBottom.height
		For Local x:Int=0 Until PixmapBottom.width
			
			Local Atop:Int = DivideMemory[topPtr[3]*opacityTop + 255*65536]
			Local Abottom:Int = DivideMemory[bottomPtr[3]*opacityBottom + 255*65536]
			Local A:Int =  255-DivideMemory[(255-Atop)*(255- Abottom) + 255*65536]
			Local div:Int = DivideMemory[Abottom*(255-Atop) + A*65536]
			
			bottomPtr[0] = DivideMemory[Atop*topPtr[0]+ A*65536] + DivideMemory[bottomPtr[0]*div + 255*65536]
			bottomPtr[1] = DivideMemory[Atop*topPtr[1]+ A*65536] + DivideMemory[bottomPtr[1]*div + 255*65536]
			bottomPtr[2] = DivideMemory[Atop*topPtr[2]+ A*65536] + DivideMemory[bottomPtr[2]*div + 255*65536]
			bottomPtr[3] = A

			topPtr:+4
			bottomPtr:+4
		Next
	Next
EndFunction

Function p_argb8888(pix1:TPixmap,af1:Float,pix2:TPixmap,af2:Float)
	Local p1:Byte Ptr = pix1.pixels
	Local p2:Byte Ptr = pix2.pixels
	
	a1=af1*255 ; a2=af2*255
	a=255-(255-a1)*(255-a2)/255
	
	k=256*a1/a
	p=256*a2*(255-a1)/255/a
	z=pix1.pitch/pix1.width
	
	i=0
	While i<=pix1.capacity
		p1[i+0] = a
		p1[i+1] = (p1[i+1]*k) Shr 8 + (p2[i+1]*p) Shr 8 
		p1[i+2] = (p1[i+2]*k) Shr 8 + (p2[i+2]*p) Shr 8 
		p1[i+3] = (p1[i+3]*k) Shr 8 + (p2[i+3]*p) Shr 8 
		
		i:+z
	Wend
EndFunction

Function p_bgr888(pix1:TPixmap,af1:Float,pix2:TPixmap,af2:Float)
	Local p1:Byte Ptr = pix1.pixels
	Local p2:Byte Ptr = pix2.pixels
	
	a1=af1*255 ; a2=af2*255
	a=255-(255-a1)*(255-a2)/255
	
	k=256*a1/a
	p=256*a2*(255-a1)/255/a
	z=pix1.pitch/pix1.width

	i=0
	While i<=pix1.capacity
		p1[i+1] = (p1[i+1]*k) Shr 8 + (p2[i+1]*p) Shr 8 
		p1[i+2] = (p1[i+2]*k) Shr 8 + (p2[i+2]*p) Shr 8 
		p1[i+0] = (p1[i+0]*k) Shr 8 + (p2[i+0]*p) Shr 8 
		
		i:+z
	Wend
EndFunction


1.6 Ghz, DDR2 266 Mhz
WritePixel/ReadPixel time :14093
Normal time :13762
ArrayTable time :198
MemoryTable time :188
Straight way time :76


Last edited 2012


Jur(Posted 2012) [#12]
Hey, that one is really fast, but it doesnt support alpha channels. Alphas make combining two images more complicated.
This is the common function for merging two images with alpha:
C = Cbottom*(1-Atop/A) + (Atop/A)*(Ctop*(1-Abottom) + Abottom* BLEND(Ctop,Cbottom))
For alpha blending BLEND(Ctop,Cbottom) = Ctop
Now good luck with converting that function to use only bit operations :)


Jur(Posted 2012) [#13]
I did some more optimization work. I looked at fixed point math and then saw that it is not really needed because I dont need fractional part. I always divide color values like value1*value2/value3 and never value1/value2.
I found one great formula which allows more optimization:
C/255 = (C+1 + (C Shr 8)) Shr 8
This formula is valid for integers from 0 to 65025 which is enough when multiplying two color with byte range. As dividing by 255 appears quite often, this formula allows to work mostly with integer math.




Results:

WritePixel/ReadPixel time:1462
Normal time:964
ArrayTable time:528
MemoryTable time:485
Normal Optimized:445
MemoryTable Optimized:440

It can be seen that this new optimization give a nice speed boost.

By the way, the results (merged images) of all these functions were checked in Photoshop and they provide correct alpha blending.

Last edited 2012


BinaryBurst(Posted 2012) [#14]
Try this one:
Function p_argb8888(pix1:TPixmap,af1:Float,pix2:TPixmap,af2:Float)
	Local p1:Byte Ptr = pix1.pixels
	Local p2:Byte Ptr = pix2.pixels
	
	z=pix1.pitch/pix1.width
	
	i=0
	While i<=pix1.capacity
	
		a = (p1[i+0]*p2[i+0]*a1*a2) Shr 24 + 1
		k = p1[i+0] Shl 8 / a
		p = (p1[i+0]*p2[i+0]) / a
		
		p1[i+1] = ( p1[i+1] * k + p2[i+1] * p ) Shr 8 
		p1[i+2] = ( p1[i+3] * k + p2[i+2] * p ) Shr 8 
		p1[i+3] = ( p1[i+2] * k + p2[i+3] * p ) Shr 8 
		p1[i+0] = a
		
		i:+z
	Wend
EndFunction

WritePixel/ReadPixel time:1710
Normal time:1053
ArrayTable time:416
MemoryTable time:431
Normal Optimized:224
MemoryTable Optimized:225
p_argb8888 time: 198


Last edited 2012


Jur(Posted 2012) [#15]
I changed "a1" to "af2" and "a2" to "af2" to make it work, but it doesnt give correct results on real images. In fact that equation for alpha (a) is weird. With shr 24 you divide with 2^24 and the result is zero. ??
I dont think much more speed gain is posssible in BlitzMax, so I may try to convert some functions to c++.

Last edited 2012


Yasha(Posted 2012) [#16]
I may be being stupid, but if you mainly want to avoid integer-divide-by-zero exceptions, can't you just do this:

Function PastePixmap_Fast(PixmapBottom:TPixmap, _opacityBottom:Float, PixmapTop:TPixmap, _opacityTop:Float)
	
	Local bottomPtr:Byte Ptr = PixmapPixelPtr(PixmapBottom)
	Local topPtr:Byte Ptr = PixmapPixelPtr(PixmapTop)
	
	
	For Local y:Int=0 Until PixmapBottom.height
		For Local x:Int=0 Until PixmapBottom.width
			
			Local Atop:Int = topPtr[3]*_opacityTop
			Local Abottom:Int = bottomPtr[3]*_opacityBottom
			Local A:Int =  255-(255-Atop)*(255-Abottom)/255
			If A
				bottomPtr[0] = Atop*topPtr[0]/A + (Abottom*bottomPtr[0]*((255-Atop)/A))/255
				bottomPtr[1] = Atop*topPtr[1]/A + (Abottom*bottomPtr[1]*((255-Atop)/A))/255
				bottomPtr[2] = Atop*topPtr[2]/A + (Abottom*bottomPtr[2]*((255-Atop)/A))/255
				bottomPtr[3] = Int(A)
			Else
				bottomPtr[0] = 0
				bottomPtr[1] = 0
				bottomPtr[2] = 0
				bottomPtr[3] = 0
			EndIf
			topPtr:+4
			bottomPtr:+4
		Next
	Next
EndFunction


Float division by zero still results in invalidated operations, which in turn become 0 when cast back to int (since you're making no attempt to actually deal with the infinity-value), so if the divisor is 0 just don't perform the operation at all, for the same result.


col(Posted 2012) [#17]
I dont think much more speed gain is posssible in BlitzMax, so I may try to convert some functions to c++.


Hiya,

My 2 cents worth...

If you've not done it already, look here for compiler optimisations

Of course SIMD is the fastest way to go with image manipulations, and it can all be done within BlitzMax too using either imported assembler, or 'c with inline assembler' files. The compiler optimisation setting will already produce SIMD code, but I'd double check the output as it may be possible to improve it, or at least reduce the instruction count.

edit:- where mentioned SIMD, I mean SSE which is SIMD extensions.

Last edited 2012


Jur(Posted 2012) [#18]
@Yasha
I want to make code as fast as possible and nothing else. For divide with zero you have two options: keep variable as float and divide (the result is not a valid number, but it doesnt matter what is written instead, as that zero is alpha, which means transparent pixel), or use an integer variable and prevent division with an if statement. Which is faster? In my test the second way was a little faster. A little.
The key for a bigger speed boost is to use integers AND divide as little as possible! Well I guess it makes sense that using simple and faster instructions will give faster code.

@col
Sounds cool, I will try different parameters and see if there will be difference. As for the assembler stuff, well I know it is fast, but that is the only thing I know about it :).


Yasha(Posted 2012) [#19]
EDIT: Stupid suggestion I didn't check properly before posting, ignore this

Last edited 2012