Sqr

BlitzMax Forums/BlitzMax Programming/Sqr

JBR	(Posted 2013) [#1]

Hi, I'm using a lot of these.

They take and return a double but I only need single precision.

Any way to speed it up?

(I dropped Cos/Sin for arrays and got a decent speed upgrade)

The range of values is from 0.0 to 1500

Any ideas?

Jim

Yasha

(Posted 2013) [#2]

Standard optimisation questions: Is this function actually taking up an unreasonable amount of time? Is it the highest priority target for optimisation? Can you make an algorithmic change instead (e.g. move the call out of a loop body)? Have you actually profiled the code? (There is no point making optimisations based on guesses about efficiency, guesses are always wrong.)

Having checked off all of the above... you'll hit the standard answer of "it depends what you want to do". How are you using the call? I doubt you'll find a solution that preserves the accuracy of answers and the generality of use, or the function would already use that method, so it depends on which tradeoffs you can afford to make.

EDIT: sorry, my above comment is stupid, I didn't bother to actually answer the question! - to get a single-precision version of Sqr, just grab the other one from the C math library:

Extern
Function SqrF:Float( x:Float )="sqrtf"
End Extern

Print SqrF(36)

The library itself is being imported anyway to provide the normal version of Sqr.

Floyd

(Posted 2013) [#3]

Be sure you actually need Sqr. The most common use is to calculate distance or length. If you really need these values the Sqr is necessary. But it is often the case that you are only comparing such values, such as to find out which object is closest by comparing distances. Then you can skip Sqr and just compare the squares of the distances.

This reminds me of the early days of Blitz3D. It uses single precision, as does Direct3D. The FPU was thus in single precision mode, doing less precise calculations and taking a little less time. That speeds up arithmetic, i.e. addition, subtraction, multiplication and division. It also speeds up square roots. But all the other mathematical functions, such as Sin ATan and Log, are unaffected. Those are calculated in their full 80-bit glory in any case.

As I recall the single precision speedup was modest, something like 15% for the affected operations. Of course it was 0% for everything else.

John G

(Posted 2013) [#4]

If I use Sin often and don't need ultimate precision, I pre-calculate an array with Integer Degrees (0 to 359) as index and Single Float output.

JBR	(Posted 2013) [#5]

Thanks folks, I definitely cannot avoid the sqr as much as I am doing.

I'm in the process of 'making the game go faster' so it will play at a decent frame rate on 'older' pcs.

I'll try the SqrF method and see ...

Thanks
Jim

JBR	(Posted 2013) [#6]

ok, no difference with SqrF

the line of code is

(cos_start_ang# + (Sqr(vx#*vx# + vy#*vy#)/divisor#) * cos_length# )

I don't think I can simplify

Jim

Floyd

(Posted 2013) [#7]

ok, no difference with SqrF

As far as I know there is literally no difference. On a PC the sqrtf function is really just a call to the usual sqrt. This may not be true on other platforms.

It is possible to set the FPU to single precision mode and maybe squeeze out a little extra speed.

Multiplication is faster than division. If you are doing many divisions with divisor# not changing then you are better off storing the reciprocal with rdivisor# = 1.0/divisor#. Then replace /divisor# with *rdivisor.

JBR	(Posted 2013) [#8]

Thanks
Jim

zzz	(Posted 2013) [#9]

Care to give us more code or context to work with?

GW	(Posted 2013) [#10]

This is what I use for a fast square root. It's good enough for distance checks, but if you use it for drawing you might see little bit of warping.

inline float fastSqrt(const float x)  
{
  union
  {
    int i;
    float x;
  } u;
  u.x = x;
  u.i = (1<<29) + (u.i >> 1) - (1<<22); 
  return u.x;
}

JBR	(Posted 2013) [#11]

GW, thanks, but how do I use this in BMax?

Jim

GW	(Posted 2013) [#12]

You need to have Mingw installed to compile the C code. Bmax will link the function when it compiles.

Look at the docs under 'Language'->'Advanced topics'->'Interfacing with C'

JBR	(Posted 2013) [#13]

Thanks GW, gained a little bit of speed.

Looking at it I don't understand how it works?

Jim

matibee

(Posted 2013) [#14]

The full explanation is here Jim:

http://en.wikipedia.org/wiki/Fast_inverse_square_root

Often called the Carmack square root because of its use in Quake, but it's actually older than that. I love the commenting of the original code - very, very apt :)

Floyd

(Posted 2013) [#15]

It exploits the way floating point numbers are represented with an exponent and mantissa. Roughly speaking, the exponent is divided by two by shifting one bit to the right, as is the fractional part of the mantissa.

The result is exact when x is an even power of two: 2^0, 2^2, 2^4... = 1, 4, 16...
It is most inexact (error about 6%) for odd powers of two: 2^1, 2^3, 2^5... = 2, 8, 32...

Yasha

(Posted 2013) [#16]

Interestingly that article also says (I am content to take their word for it) that this Thing of Beauty is still slower than using SSE's square root instruction.

What are the odds of your code being run on something lower-end than a Pentium III? If it's a game made in this day and age it almost certainly needs better than that anyway, so perhaps an SSE version is the best option? (One wonders why libc wouldn't do this already..?)

Floyd

(Posted 2013) [#17]

I'm mildly surprised that a BlitzMax version of fastSqrt runs about 90% as fast as the C version.

Function fastSqrt:Float( x:Float )

	Local fp:Float Ptr = Varptr x
	Local ip:Int Ptr = Int Ptr fp
	
	ip[0] = ( ip[0] Sar 1 ) + $1FC00000		' magic number is (1 Shl 29) - (1 Shl 22)
	Return fp[0]

End Function

zzz	(Posted 2013) [#18]

Relating to another topic on here, pasting that code directly into where its used instead of having a separate function should be considerably faster.

EDIT: I did a quick test, and a function relying on the SSE square instruction is about as fast as inlining the above code. AVX was about ~50% faster still, but then you'd narrow it down a bit too much :)

You would have to redesign your code though, since youll need to pass 4 values at once. There are usually better ways to speed up this kind of code, which is why i asked for a better explanation on what its doing earlier.