mmx image blending benchmark

BlitzMax Forums/BlitzMax Programming/mmx image blending benchmark

zzz(Posted 2012) [#1]
Hi, im curious to see how this code performs on various machines. It is related to a small project im considering starting.

Windows only because of the assembly code involved. Im not sure how things work on the mac and linux side, so cant do much about that.

The files need to be in the same directory. It will save the resulting blended images as png's, so take a look and check if they all look the same (they should).

Please post your results :)

EDIT: threading involved, so make sure to enable that!

something.bmx


blend.s


Last edited 2012


zzz(Posted 2012) [#2]
My own results on a phenom 2 x6 at 2,8Ghz




BlitzSupport(Posted 2012) [#3]
Here are my results. The speeds seem quite variable -- the first test varies by a whole second! (There's nothing heavy running in the background.)






Derron(Posted 2012) [#4]
for linux the ".s"-file has to get configured with:
format ELF
instead of format MS COFF

But afterwards you get errors - complaining about missing functions -> the 3 ones defined in the asm-file.

bye
Ron


ImaginaryHuman(Posted 2012) [#5]
Wow, I haven't seen people writing assembly code for ages. It doesn't surprise me too much that writing stuff in assembler is *several* times faster than code converted from a higher level language. If assemble wasn't so.... longwinded, it would be so much more appealing. I can't imagine writing anything of significant size in it unfortunately.

Brings me back to the Amiga days. Ahhhh 68k.


Yasha(Posted 2012) [#6]
It doesn't surprise me too much that writing stuff in assembler is *several* times faster than code converted from a higher level language.


Only if a) you're really, really good at writing assembler, and b) your HLL compiler isn't optimising very well.

You're flat-out never going to write better assembly by hand than the GCC backend will.


col(Posted 2012) [#7]
Results from Sony Vaio VGN-FW31M.
Well done, the speed increase is well worth it.

MT/MMX
1024x1024 - 8.25x
2048x2048 - 10.25x
4096x4096 - 12.30x

The larger the image the greater the gains. Have you considered SSE2/3 ? to take advantage of the 128bit instructions.




zzz(Posted 2012) [#8]
@col
I just assumed sse was all about floating point stuff. I'll write a version using sse2 when I have some time left over to see how it performs. The multithreaded mmx part already hits max bandwidth on my system though, so I would'nt expect much of a speed improvement. Might be some headroom left on other systems. :)

Last edited 2012


ImaginaryHuman(Posted 2012) [#9]
Oh, no way, there is no way an automated optimizer can do a better job than hand-written assembler by a skilled programmer.


Yasha(Posted 2012) [#10]
Yeah, if said "skilled programmer" is a professor at MIT who's spent the last twenty years directing optimisation research and writing LLVM components, they could probably do a better job. Because they know everything that went into the automated system. Thousands upon thousands of rewrite rules, machine-specific tweaks, etc.

The average engineer programmer? Absolutely, definitely not. They will, however, almost certainly think they can do better (this is very common: the vast majority of people have no idea just how complicated hardcore optimisation is), and then accuse the compiler of cheating somehow when they inevitably fail.

The main reason why this is true is because optimisation isn't about skipping a few mul instructions here, or tweaking a vtable there, any more. Nowadays optimisation involves heavy rewriting from algorithmic form, and is usually completely unrecognisable at the assembly level. It's too complex to keep in one's head (if it isn't, either the program is trivial or the developer isn't thinking at the algorithmic level, in which case they'll produce very fast minicomponents that don't work well together because it's impossible to design a whole program at that granularity).


...please, don't be insulted (too many people get in a huff). Claiming a human can write better assembly code than a compiler is like saying a human has faster reaction times than an aimbot. It's not only wrong, it's conceptually ludicrous. We just happen to exist at the turning-point of the art where there are still a very small, ever-shrinking number of programmer savants who can work at that level. In five years that breed of coder will be extinct (in twenty years they will start to admit it).

Last edited 2012


ImaginaryHuman(Posted 2012) [#11]
Really? Check out any `demo` from the demo scene where clever use of assembly language has provided massively faster performance, to such a degree that you have to ask yourself how it can be possible.


zzz(Posted 2012) [#12]
Here's an SSE2 version along with some small issues corrected. It's not the nicest assembly code, but I'm not sure if we can trust pixmap surfaces to be 16byte aligned?

something.bmx


blend.s



zzz(Posted 2012) [#13]
Bandwidth limited just as I guessed :) I get almost the same result with only three threads.

phenom 2 x6 @ 2,8Ghz




BlitzSupport(Posted 2012) [#14]



col(Posted 2012) [#15]
Marginally faster, and yes, TPixmap memory uses MemAlloc which is aligned on 16 byte boundaries.




zzz(Posted 2012) [#16]
That is nice to know. Heres some new assembly. Less branching, more parallelization and a fancy new memory write instruction i found in the references. I'm all out of ideas now though, so guess I'll clean up the code. This did give way better performance than I expected from it :)

blend.s


My results:



col(Posted 2012) [#17]
Yep... Faster still :)

Multithreaded MMX/SSE2 blend: 208 ms. (17.45x) memory: 9.169/4.584 GB/s
Saving result to blend_mtmmx.png
Done


EDIT:-
Although using 2048x2048 I'm getting...

Singlethreaded MMX/SSE2 blend: 64 ms. (14.28x) memory: 7.812/3.906 GB/s
Saving result to blend_stmmx.png
Done

Multithreaded MMX/SSE2 blend: 71 ms. (12.87x) memory: 7.042/3.521 GB/s
Saving result to blend_mtmmx.png
Done

Last edited 2012


ImaginaryHuman(Posted 2012) [#18]
Try interleaving two threads running on the same block of memory in parallel... ie each one accesses every other integer. It's good for the cache, or at least that's what I found.


xlsior(Posted 2012) [#19]
With the original code in your 1st posting, I'm seeing:




xlsior(Posted 2012) [#20]
And the last one:



(Using an 6-core Intel i7-3930K Processor, btw)


zzz(Posted 2012) [#21]
@ImaginaryHuman
I'd have to do interleavedwrites too then, which would'nt be very good at all.

Since this seems to be memory bandwidth bound now I'm going to let it be for a while. Think I'll try and make a faster memcopy function and see if that will result in anything that would be useful for this.

And xlsior, that is one fast cpu :D Clock/active memory channels?


xlsior(Posted 2012) [#22]
Cpu: stock speed at 3.2ghz, 3.8ghz burst, 6 cores + hyper threading.
Ram: 64gb quad channel 1600mhz ddr3