32 bit math?

Community Forums/Monkey2 Talk/32 bit math?

marksibly(Posted 2016) [#1]
Hi,

I'm just cleaning up the math stuff and wondering if there's any point adding 32 bit float versions of sin, cos, floor, etc, or should these all just use doubles?

In terms of passing/returning function params, floats/doubles will use the same FP registers (AFAIK) so I can't see there's any extra overhead there.

32 bit floats are IMO still a good idea for storage (vars, fields etc) as they're smaller (ie: less memory bandwidth) but when computing intermediate values is there any point anymore?

C++ still has float/double versions of sin, cos etc, but that could just be a legacy thing.

[edit]
Doubles significantly faster?
[/edit]




nullterm(Posted 2016) [#2]
Think it's platform dependent, some may run faster with the 32bit version (what I'm used to using 99.99% of the time), others 64bit double might be faster.

For 32bit, can you use the sinf/cosf function and test if there's any benefit?

Curious what the std::sin/cos do under the hood, if it uses the double value and convert, or if std::sin<float> uses the raw 32 bit one?


impixi(Posted 2016) [#3]
Interesting. On my iMac (late 2015 edition, 6th Gen Intel i5 CPU - I think) I see over 3X difference. Also see over 3X difference on my notepad computer (Windows 10, 5th Gen Intel i7 CPU).

EDIT: Deleted my error... But never forgotten...


marksibly(Posted 2016) [#4]
> I see over 3X difference

In favour of double I assume?

I'm pretty surprised - even with 'sum' declared as a float, the double version is still faster. Seems like c++ is doing something weird here.


impixi(Posted 2016) [#5]
Yes, in favour of double.


marksibly(Posted 2016) [#6]
> For 32bit, can you use the sinf/cosf function and test if there's any benefit?

Good point, but it doesn't make any noticeable difference - sinf and std::sin( float ) seem to do the same thing, which I'd always assumed (but never bothered checking!). Test code above updated.


FelipeA(Posted 2016) [#7]
Tested it on raspberry pi 2 and double was 2x faster. So yeah!


marksibly(Posted 2016) [#8]
> Tested it on raspberry pi 2 and double was 2x faster

Wow! That would have been the target I'd be most concerned about!


GW_(Posted 2016) [#9]
I'm not sure I understand all the ramifications of eliminating the 32bit float versions. I do a lot of statistical stuff and machine learning with Bmax and frequently need to connect with other code in c or dll's.
I'd like to port some of my stuff over when the language stabilizes. Will your proposal effect that? Is Monkey2 designed build both 32 and 64bit apps?


marksibly(Posted 2016) [#10]
There shouldn't be any ramifications beyond float math functions being faster and, I guess, more accurate!

In fact, I just tested 32/64 bit versions on Windows and 32 bit builds are 3X faster while 64 bit are 5X! Built with:

g++ -std=c++11 -O3 -m32 t.cpp
g++ -std=c++11 -O3 -m64 t.cpp

A bit worried I'm 'missing something' here though...

> BTW: <chrono> is not implemented under Windows, IIRC.

Seems to work if you use -std=c++11.


marksibly(Posted 2016) [#11]
Ok, my bad, the compiler was removing the 'sin' calls since sum is never used!

Fixed now and the results are closer to what I'd expect - on windows, doubles are still a little bit faster, while on macos they're are teeny bit slower.

But in neither case are the calls to sin inlined, so it's probably possible to make them even faster with the right compiler options...

Can you test on raspberry pi again, ilovepixel?

Also updated test code above.


FelipeA(Posted 2016) [#12]
Hi Mark,
In the templated test, did you mean to cast to T or to double? Because if you run the test<float>() it won't run sin/cos/abs for double but for float.


marksibly(Posted 2016) [#13]
Cast. This tests the std::sin overloads for both float and double.


impixi(Posted 2016) [#14]
>> BTW: <chrono> is not implemented under Windows, IIRC.
> Seems to work if you use -std=c++11.

OMG. I can't believe I forgot to use that switch.


Fixed now and the results are closer to what I'd expect - on windows, doubles are still a little bit faster, while on macos they're are teeny bit slower.



I see those results here too...


FelipeA(Posted 2016) [#15]
Cast. This tests the std::sin overloads for both float and double.


I was asking because the cast seem redundant.

This is the overall result on raspberry pi 2
float:41.645984
floatf:41.605996
double:15.027483


It's pretty good!

Edit:

It looks like if I printf the _tmp value at the end I get almost a similar performance between double and float. I think maybe the compiler is optimizing the variable. A thing a noticed while analyzing the assembly is that for some reason is converting from float to double when doing the 32 bit operation.

test<float>
		for (T t = 0; t<360; t += .0001)
00007FF6CF8011CB  cvtps2pd    xmm1,xmm6  
		{
			sum += std::sin((T)t);
00007FF6CF8011CE  addss       xmm7,xmm0  
00007FF6CF8011D2  addsd       xmm1,xmm8  
00007FF6CF8011D7  cvtpd2ps    xmm6,xmm1  
00007FF6CF8011DB  comiss      xmm9,xmm6  
00007FF6CF8011DF  ja          test<float>+0C3h (07FF6CF8011C3h)  


as you see if it didn't use the instruction "cvtps2pd" and "cvtpd2ps" in the float test it would be almost the same.with the double one. This instructions make you lose about 8 cycles. Is not much but when doing a lot of iteration it can be noticeable.

test<double>
			sum += std::sin((T)t);
00007FF659B51373  movaps      xmm0,xmm6  
00007FF659B51376  call        sin (07FF659B52444h)  
00007FF659B5137B  addsd       xmm6,xmm8  
00007FF659B51380  addsd       xmm7,xmm0  
00007FF659B51384  comisd      xmm9,xmm6  
00007FF659B51389  ja          test<double>+0C3h (07FF659B51373h)  


I also thing the code benefits from auto vectorization. Which usually is a compiler dependency. For example when I tried the same code with mingw64-gcc it gave almost the same performance between floats and doubles. Also if at some point you add a branch in the middle you may also mess up auto vectorization.

I haven't looked in detail at the output of raspberry pi yet. I'll look it during the weekend.


marksibly(Posted 2016) [#16]
> It looks like if I printf the _tmp value at the end I get almost a similar performance between double and float. I think maybe the compiler is optimizing the variable.

Yep, looks like it. The pi compiler (linux g++?) apparently does a better job than mingw or llvm!

> A thing a noticed while analyzing the assembly is that for some reason is converting from float to double when doing the 32 bit operation.

Yeah, I noticed this too. It seems unnecessary but I assume they know what they're doing...?

Another thing I noticed is that the compiler never inlined 'sin' - it was always a function call (missing from your first loop due to _tmp being optimized out?). I assume the FPU has sin though and it could just be a case of finding the right compiler switches. I tried a few but didn't get far.

In fact, if the compiler is generating a function call then that would explain the double/float conversion as the parameter needs to be in the correct format.

But I'm not even sure what FPU's are common these days! mmx? sse? Which is better etc? Anyone know more about this?


FelipeA(Posted 2016) [#17]
At least msvc is using the avx (xmm0 - xmm7 and xmm8 - xmm15 in 64bit arch) vector register for doing floating point math. mingw-gcc is not. The funny thing is that while debugging I noticed that it's not really vectorizing the code but just using a 128bit register to do operations over a single 32/64 bit value, which for me seems like a total waste. I am pretty sure the code could go upto 2x faster if properly vectorized. But of course it's a very specific case so It wouldn't work as a "compiler optimization".


dawlane(Posted 2016) [#18]
The output I'm getting on an 1st Gen i7 with Ubuntu 4.9.3-8ubuntu2~14.04 is
32bit
float:0.575295
floatf:0.531015
double:1.626234

64bit
float:0.514282
floatf:0.513274
double:1.375198

You can find out the defaults that GCC uses with
g++ -Q --help=target.
g++ -Q --help=optimizers

You would need to see https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html for what is invoked with the -O3 switch.

>But I'm not even sure what FPU's are common these days! mmx? sse? Which is better etc? Anyone know more about this?
I would think that it would be more of 'what instructions a CPU supports than what is common'. I would think that Core2 sse instruction set would be the most common for a few more years.