# Table of Contents

- 1. Forword & Introduction
- 2. Motivation
- 3. Basics
- 4. Creating fixed-point numbers
- 5. Basic binary operations
- 5.1 Addition
- 5.2 Subtraction
- 5.3 Multiplication
- 5.4 Division

- 6. Basic unary operations
- 7. Trigonometric functions
- 8. Square root, exponentiation and logarithm
- 8.1 Square root
- 8.2 Logarithm
- 8.2.1 Logarithm using Arctanh
- 8.2.2 Logarithm using Taylor series

- 8.3 Exponentiation

- 9. Conclusion & References

# 1. Foreword & Introduction

This article presents my own approach of implementing a fixed point number class. It's not the best nor most efficient implementation, but it hopefully helps in understanding fixed point numbers. And of course, no guarantee for correctness... ;-)

After summarizing shortly the theoretical basics, the fixed point number format along with the most common mathematical operations are developed and discussed step by step. This tutorial is based on 16.16 fixed point numbers, but the explanations should be general enough to easily extended them to any fixed point number format.

# 2. Motivation

*Why another article on this topic you might ask?*

Well, the goal is to have a very practical tutorial with examples for each operation to quickly grasp the topic. It shall not be over-complicated, still it should be not wrong at all and should provide the basics in order to have a nice starting point to dive into the fixed point number subject.

*What's the use of fixed point numbers?*

Some older or low-power microcontrollers do not have a floating point number unit, thus cannot calculate decimal numbers in hardware. So the handling of decimal numbers must be performed in software. As an example the first versions of J2ME (Java Platform, Micro Edition) used on mobile devices had no native support for decimal numbers.
Also for some special use cases like in DSP, fixed point numbers could turn out more practical and efficient. And by the way, it's an interesting topic.

# 3. Basics

Fixed points numbers are represented using integer values. As the name implies, the position of the point between the integer and fractional part is fixed - meaning the number of bits representing the integer part as well as the number of bits representing the fractional part is always the same. That's a huge difference to floating point formats like IEEE 754 where the number of bits used for the integer part and fractional parts differs from number to number, gaining a much better accuracy and wider range of presentable numbers.

In the following, the values NUM_INT_BITS and NUM_FRAC_BITS define the number of bits used for representing the integer respectively the fractional part of a decimal number.

The examples and implementation uses all a 16.16 representation, meaning:

- NUM_INT_BITS = 16
- NUM_FRAC_BITS = 16
- NUM_TOTAL_BITS = NUM_INT_BITS + NUM_FRAC_BITS = 32

Thus a single fixed point number is represented by a 32 bit value as follows:

Index | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | -1 | -2 | -3 | -4 | -5 | -6 | -7 | -8 | -9 | -10 | -11 | -12 | -13 | -14 | -15 | -16 |

Weight | 2^{15} | 2^{14} | 2^{13} | 2^{12} |
2^{11} | 2^{10} | 2^{9} | 2^{8} |
2^{7} | 2^{6} | 2^{5} | 2^{4} |
2^{3} | 2^{2} | 2^{1} | 2^{0} |
2^{-1} | 2^{-2} | 2^{-3} | 2^{-4} |
2^{-5} | 2^{-6} | 2^{-7} | 2^{-8} |
2^{-9} | 2^{-10} | 2^{-11} | 2^{-12} |
2^{-13} | 2^{-14} | 2^{-15} | 2^{-16} |

Bitvalue | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x |

Here the weight of each bit that contributes to the actual numerical value is explicitely noted. To have a shorter respresentation, all subsequent binary respresentations of fixed point numbers are shown without indices and weights:

x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x |

The bit value at position i is denoted as b_{i}, where i is the index in range of {15, ... 1, 0, -1, ..., -16}.

## 3.1 Ranges and respresentations

INFO

This fixed point number approach targets signed numbers, so all subsequent explanations and definitions addresses only signed numbers. However, handling only unsigned numbers should be quite easy after reading this article.

The **integer** part can contain 2^{NUM_INT_BITS} values. Because we want to represent also negative values, the two-complement form is used.

This means the integer as itself can take values of range { -2^{NUM_INT_BITS}, 2^{NUM_INT_BITS} - 1 }.

More general, because we use a NUM_TOTAL_BITS = 32 bit, the range is

{ 2^{NUM_TOTAL_BITS} / 2^{NUM_FRAC_BITS} - 1, -2^{NUM_TOTAL_BITS} / 2^{NUM_FRAC_BITS} } = { 2^{32} / 2^{16} - 1, -2^{32} / 2^{16} } = { 2^{16} -1, -2^{16} } = { 32767, -32767 }.

The **fractional** part consists of NUM_FRAC_BITS, with the weight ranging from 2^{-1} to 2^{-16}.

So the biggest fractional smaller than the next integer value is 2^{-1} + 2^{-2} + ... + 2^{-NUM_FRAC_BITS}, with NUM_FRAC_BITS = 16 this result in ≈ 0,999985. This is by the way the same as 2 - 2^{-16} / (2^{-16} - 1) ≈ 0,999985.

The smallest possible difference between two numbers is given by 2^{-16} ≈ 0,00001526.

**In general**, the value of a fixed-point decimal is given by the equation:

0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

The value x of this fixed-point number is calculated as:

$x={2}^{3}+{2}^{2}+\frac{1}{{2}^{1}}+\frac{1}{{2}^{2}}=8+4+0.5+0.25=12.75$1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

The value x of this fixed-point number is calculated as:

$x={-2}^{\mathrm{15}}+{2}^{8}+\frac{1}{{2}^{3}}=-32768+256+0.125=-32511.875$Let's have a look at some interesting and boundary values:

- The number zero is trivial - all bits are zero.

- The biggest number that can be represented is as follows:
0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ^{14}+ ... + 2^{1}+ 2^{0}+ 2^{-1}+ ... + + 2^{-16}≈ 32767,999985.

- The smallest number that can be represented is as follows:
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ^{15}= -32768.

# 4. Creating fixed-point numbers

To work with fixed-point numbers, the first implementation step is to create fixed-point numbers with the desired value.

## 4.1 From integer numbers

This is the easiest case. Having an integer value, just shift it the number of fractional bits to the left to skip the fractional part. This also works for negative values as integer values are stored the same way (in two-complement from) as the integer part of our fixed-point numbers.

The result of 12 << 16 is:

0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Hex value: 0x00C00000 |

## 4.2 From floating-point numbers

At first, a constant FP_ONE is introduced that represents the value 1 in fixed-point format. Using the previous chapter, this is just 1 << NUM_FRAC_BITS, thus 0x00010000. This constant is used to build and create a fractional value between 0 and 1 in fixed-point number - just multiply it with FP_ONE!

Calculate 0.25 * FP_ONE = 0.25 * 0x00010000 = 0.25 * 65536 = 16384 = 0x00004000.

So the bit at position -2 is only set, so the value is -2^{2} = 0.25!

Summarized, creating a fixed-point number from a floating-point number can be done in two steps. First convert the integer part, then the fractional part and finally add both values.

int intPart = (int)x;

// get fractional part of passed value

float fracPart = x - intPart;

// convert the integer part to fixed point

fpValue = intPart << NUM_FRAC_BITS;

// convert the fractional part to fixed point and add it to converted integer part

fpValue += (int)(SUNFP_ONE * fracPart);

This can be shortened to following one-liner!

## 4.3 From strings

Basic idea is to find the decimal point: the part in front of the decimal point is the integer part and this can be directly converted from string to int. Then the integer conversion from chapter 4.1.1 can be used.

Same applies to the case if no decimal point can be found in the string - then there is no fractional part at all, only an integer part.

The handling of the fractional part is a bit more complicated. Take the fractional part as itself and convert it to an integer. This is multiplied by FP_ONE. This results in a fixed point number with the value of the fractional part - but that is too large by a power of ten! So count the number of characters in the fractional part and divide the intermediate FP by 10^{<number of characters>}. This gives the correct fractional part that is added to the integer part.

The "3" before the decimal point is converted to integer and then to fixed-point number, this gives 0x0003000.

The fractional part "75" after the decimal point is also converted to integer, this gives 75. This is multiplied by FP_ONE, so 75 * 0x00010000 = 0x4B0000.

The fractional part "75" has two digits, so divide 0x4B0000 by 10^{2} = 0x4B0000 / 100 = 0xC000.

Finally both parts are added: 0x00030000 + 0xC000 = 0x0003C000. That is our fixed point number.

Recheck the represented value: 0x0003C000 = b0000 0000 0000 0011 1100 0000 0000 0000 -> 2^{1} + 2^{0} + 2^{-1} + 2^{-2} = 3.75!

However, what about negative numbers? Consider the number "-4.324". In short, the calculation would be rougly as -4 + 324/1000 = -3,676!

To fix it, the integer part is subtracted by 1 and then "1 - fractional part" is added. In the case of "4.32" this would be -4 - 1 + ( 1 - 324/1000) = -5 + 676/100 = -4,324. Yes!

So here a possible implementation:

{

s = s.Trim();

if (s.Contains('.'))

{

string[] parts = s.Split(new char[] {'.'}, 2);

// get integer part

fpValue = GetIntFromStr(parts[0]) << NUM_FRAC_BITS;

// get fractional part

if (parts.Length > 1)

{

int numFracDigits = parts[1].Length;

int fracInt = GetIntFromStr(parts[1]);

fracInt = fracInt * SUNFP_ONE / (int)Math.Pow(10, numFracDigits);

if (fpValue < 0)

{

fpValue -= SUNFP_ONE;

fracInt = SUNFP_ONE - fracInt;

}

fpValue |= (fracInt & MASK_FRAC_BITS);

}

}

else

{

/* no fractional part, try to parse it as int */

fpValue = GetIntFromStr(s) << NUM_FRAC_BITS;

}

}

# 5. Basic binary operations

In this chapter, following basic operations are discussed: addition, subtraction, multiplication and division.

## 5.1 Addition

Addition is actually ridiculous easy - just adding two fixed-point numbers result in the correct value. This works as we use the same 16.16 format for all numbers. Of course, if the decimal point is not at the same position for both numbers, special care has to be taken. As normal integer addition is used, there is also the problem of overflowing, e.g 32767 + 1 = -32768.

{

SunFp fpValueResult = fpValue1 + fpValue2;

return fpValueResult;

}

## 5.2 Subtraction

The same as for addition also applies for subtraction - just use integer subtraction, nothing interesting here.

{

SunFp fpValueResult = fpValue1 - fpValue2;

return fpValueResult;

}

## 5.3 Multiplication

Let's see if two fixed point numbers can just be multiplied to get the correct value, similar to addition and subtraction:

The numbers 4 and 5 have the values in fixed point format 0x00040000 and 0x00050000. Calculating 0x00040000 * 0x00050000 results in 0x1400000000!

Two problems are obvious here:

- At first, the result is far too big. The expected result of 4 * 5 is 20 (0x14), that in fixed point format 0x00140000.

So the calculated value 0x1400000000 is too large by factor 0x10000 = 2^{16}. - Secondly, this value does also not fit into our 16.16 format.

So first address the result issue itself: Because the result is too large by factor 2^{16}, divide it by this value which is actually just shifting it by 16 (more specific: by NUM_FRAC_BITS) to the right.

NOTE

Shifting one of the two operands by 16 before the multiplication won't work because that will lose the fractional part.

Example: Let's multiply 1.5 by 1.5, that would be 0x00011000 * 0x00011000. If one operand is shifted first by sixteen, it would result in 0x00011000 * 0x00000001 = 0x00011000. This is again 1.5 because one operand was shorted by the shift from 1.5 to 1, so actually 1.5 * 1 was calculated.

Shifting the result by 16 leads to following possible implementation:

{

SunFp fpValueResult = (int)((long)(fpValue1 * fpValue2) >> 16);

return fpValueResult;

}

However, there is still the small issue that the intermediate result does not fit into our 16.16 fixed point format, therefore this intermediate result has to be cased to *long* before shifting.

A possible solution is to treat the integer and fractional parts of both operands separately as following examples demonstrates:

The result of 3.5 * 2.4 is 8.4. This can also be calculated as follow:

3.5 * 2.4

= 3 * 2.4 + 0.5 * 2.4

= 3 * 2 + 3 * 0.4 + 0.5 * 2 + 0.5 * 0.4

= 6 + 1.2 + 1 + 0.2 = 8.4.

This can be implemented as follows:

{

int intPart1 = fpValue1 >> NUM_FRAC_BITS;

int intPart2 = fpValue2 >> NUM_FRAC_BITS;

int fracPart1 = fpValue1 & MASK_FRAC_BITS;

int fracPart2 = fpValue2 & MASK_FRAC_BITS;

int fpValueResult = 0;

fpValueResult += (intPart1 * intPart2) << NUM_FRAC_BITS;

fpValueResult += (intPart1 * fracPart2);

fpValueResult += (fracPart1 * intPart2);

fpValueResult += ((fracPart1 * fracPart2) >> NUM_FRAC_BITS) & MASK_FRAC_BITS;

}

## 5.4 Division

After having discussed multiplication, it's not suprising that similar issues arise with division. Instead of dividing the result by 2^{16}, it has to be multiplied by 2^{16}, thus shifted to the left by 16. However, the shift must now occur before the actual division, otherwise precision is lost.

Let's divide 6.25 by 2.5 (= 2.5). In fixed point format, that is 0x00064000 divided by 0x00028000. However, 0x00064000 / 0x00028000 = 0x00000002 = 2, and not 2.5 as expected!

So the dividend has to be shifted to the left by 16 *before* division:

0x00064000 << 16) / 0x00028000 = 0x0000000640000000 / 0x00028000 = 0x00028000 which is 2.5 in fixed point format.

This means again that we need a *long* data type for implementation:

{

int fpValueResult = (int)((long)(fpValue1 << 16) / fpValue2);

}

The only way that came into my mind to avoid the *long* data type is follows but unfortunately it introduces a small loss of precision. (If you know a better solution, I would be glad to be informed about it - feel free to contact me!):

Basic idea is to replace the division by multiplication with the reciprocal, i.e. instead of a / b, it's calculated a * (1 / b).

Well, nothing gained, there is still a division left. And the reciprocal 1 / a of value a is in fixed point format (1 << 32) / a.

Again a *long* is needed to hold the value 1 << 32...

Here comes the step that avoids the *long* however at which also precision is lost: It is shifted only by 31 bits during reciprocal calculation, then the multiplication is performed, and afterwards the last shift by one to the left is performed.

Let's compare both values with example.

a) Using 32bit left shift:

First calculate 1 / 2.5. That's ((1 << 32) / 0x00028000) = 0x00006666 (≈ 0.39999).

Then final calculation: 20 * 0.39999 , so in FP format 0x00140000 * 0x00006666 = 0x7FFF80000. Finally right-shift by 16 to right gives 0x7FFF8, that actually ≈ 7.9998.

b) Using 31bit left shift:

That's ((1 << 31) / 0x00028000) = 0x80000000 / 0x00028000 = 0x3333.

Now do the multiplication and shifting: (0x00140000 * 0x000003333) >> 16 = 0x3FFFC0000 >> 16 = 0x0003FFFC.

Left-shift this result to the left by one: 0x00003FFC << 1 = 0x0007FFF8, that's again ≈ 7.9998.

Compare it to the variant using long:

Let's divide 20 by 2.5 (= 8).

(0x00140000 << 16) / 0x00028000 = 0x001400000000 / 0x00028000 = 80000 which is 8 in fixed point format.

Let's see the implementation without using *long* data type:

{

uint v2reciprocal = 1;

v2reciprocal <<= 31;

v2reciprocal = (uint)(v2reciprocal / fpValue2);

int fpValueResult = fpValue1 * v2reciprocal;

fpValueResult <<= 1;

}

# 6. Basic unary operations

In this chapter, following basic unary operations are discussed: floor, ceil, round and absolute.

## 6.1 Floor

The floor function applied to a decimal number x gives the greatest integer number that is equal or less than x, e.g. floor(1.2) = 1 and floor (-2.4) = 3.

Due to the fact how this fixed point number is represented, the implementation is very easy - just mask out the fractional part. This also works for negative numbers as following examples make clear:

a) Floor(1.25):

0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | (Hex 0x00012000) --> Decimal value: 1.25 |

Mask out the fractional part (lowest 16 bits) results in:

0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | (Hex 0x00010000) --> Decimal value: 1 |

b) Floor(-11.625):

1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | (Hex 0xFFF46000) --> Decimal value: -11.625 |

Mask out the fractional part (lowest 16 bits) results in:

1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | (Hex 0xFFF40000) --> Decimal value: -12 |

Decimal numbers that are actually integer values are the trivial case, because in fixed-point format all fractional bits are zero, thus masking them out does not change the actual value, so e.g. floor(-12) = -12.

Here's the implementation:

{

SunFp fpValueResult = (int)(fpValue & 0xFFFF);

return fpValueResult;

}

## 6.2 Ceil

The ceil function applied to a decimal number x gives the greatest integer number that is equal or greater than x, e.g. ceil(1.2) = 2 and floor (-2.4) = -2.

Ceil and floor are tightly coupled, so an approach is to implement the ceil function using the floor function: Ceil(x) = Floor(x+1). However this works not for plain integer numbers, as the result is then off by one, so this case has to be treated specially:

- If x has a decimal part (no plain integer), then Ceil(x) = Floor(x+1).
- If x has no decimal part (plain integer), then Ceil(x) = x.

a) Ceil(1.25) - no plain integer:

0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | (Hex 0x00012000) --> Decimal value: 1.25 |

Add one to the value (FP fpormat 0x00010000), then mask out the fractional part (lowest 16 bits) results in:

0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | (Hex 0x00020000) --> Decimal value: 2 |

b) Ceil(-11.625) - no plain integer:

1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | (Hex 0xFFF46000) --> Decimal value: -11.625 |

Add one to the value (FP fpormat 0x00010000), then mask out the fractional part (lowest 16 bits) results in:

1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | (Hex 0xFFF50000) --> Decimal value: -12 |

c) Ceil(-4) - plain integer:

1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | (Hex 0xFFFFC000) --> Decimal value: -4 |

-4 is a plain integer (fractional part is zero), thus nothing has to be done.

Here's a possible implementation:

{

SunFp fpValueResult = 0;

if ((fpValue & 0x0000FFFF) == 0)

{

fpValueResult = fpValue;

}

else

{

fpValueResult = ((fpValue + SUNFP_ONE) & 0x0000FFFF);

}

return fpValueResult ;

}

## 6.3 Round

The round function applied to a decimal number x gives the nearest integer number to x, e.g. round(1.2) = 1, round(1.7) = 2 and round(-2.4) = -2.

Again the floor function can be used to implement the round function as follows:

- If x >= 0, then round(x) = floor(x + 0.5).
- If x < 0, then round(x) = floor(x - 0.5).

Due to the way negative numbers are respresented in this fixed-point format, always the first implementation case is required, thus round(x) = floor(x + 0.5).

a) Round(1.25):

0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | (Hex 0x00012000) --> Decimal value: 1.25 |

Add 0.5 to the value (FP format 0x00008000) gives 1.75 (FP format 0x0001C000), then mask out the fractional part (lowest 16 bits) results in:

0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | (Hex 0x00010000) --> Decimal value: 1 |

a) Round(1.75):

0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | (Hex 0x0001C000) --> Decimal value: 1.75 |

Add 0.5 to the value (FP format 0x00008000) gives 2.25 (FP format 0x00024000), then mask out the fractional part (lowest 16 bits) results in:

0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | (Hex 0x00020000) --> Decimal value: 2 |

Note that the rounding of value .5 is ambiguous, e.g. the distance from 2.5 to 2 and from 2.5 to 3 is the same, thus the rounding of values with decimal part .5 has to be defined. This implementation returns the larger value in this case.

Here's the implementation:

{

SunFp fpValueResult = 0;

fpValueResult = fpValue + SUNFP_HALF;

fpValueResult = SunFp.Floor(fpValueResult);

return fpRet;

}

## 6.4 Absolute

The abs function applied to a decimal number x gives the positive value of x independent of the sign of x, e.g. abs(4.5) = 4.5 and abs(-4.5) = 4.5.

Actually this can be easily implemented by handling two distinct cases:

- If x >= 0, just return x.
- If x < 0, then subtract x from zero, thus abs(x) = 0 - x.

Abs(-0.5): The value -0.5 is represented in FP format as 0xffff8000. 0x00000000 - 0xffff8000 gives 0x00008000 (note the truncation of the left-most bit), that's 0.5.

# 7. Trigonometric functions

In this chapter, following trigonometric functions are discussed: Sine, cosine and tangent.

## 7.1 Sine

To calculate the sine of decimal value x, an approximation using Taylor series is used:

$\mathrm{sin}\left(x\right)=x-\frac{{x}^{3}}{3!}+\frac{{x}^{5}}{5!}-\frac{{x}^{7}}{7!}+\cdots $The more terms are used, the better the accurary of the result will be. Take shall be taken of overflow: Already 9! = 362880 does not fit into the 16bit integer part. However x^{9}/9! can also be written as x^{9}/(7! * 72) and 7! still fits into a 16bit number.

Note that the input value must be in range [-PI, PI] - the transformation of any value to that range is easy, but not part of following implementation:

{

SunFp fpValueResult = fpValue;

SunFp tmp;

SunFp fac7 = Factorial(7);

if (fpValue < -SUNFP_PI || fpValue > SUNFP_PI)

{

throw new ArgumentException("Sin: Value out of range [-PI, PI]");

}

tmp = Power(fpValue, 3) / Factorial(3);

fpValueResult = fpValueResult - tmp;

tmp = Power(fpValue, 5) / Factorial(5);

fpValueResult = fpValueResult + tmp;

tmp = Power(fpValue, 7) / fac7;

fpValueResult = fpValueResult - tmp;

/* 9! will overflow the 16bit integer part, so calc step by step */

tmp = Power(fpValue, 9) / fac7;

tmp = tmp / (9 * 8);

fpValueResult = fpValueResult + tmp;

return fpValueResult ;

}

## 7.2 Cosine

To calculate the cosine of decimal value x, again an approximation using Taylor series is used:

$\mathrm{cos}\left(x\right)=1-\frac{{x}^{2}}{2!}+\frac{{x}^{4}}{4!}-\frac{{x}^{6}}{6!}+\cdots $Also here the input value must be in range [-PI, PI] . As the implementation is very similar to sine, the listing is left out.

## 7.3 Tangent

The easiest calculation of tangent is using the equation tan(x) = sin(x) / cos(x) because the functions for sine and cosine already exists:

{

if (fpValue <= -SUNFP_PIHALF || fpValue >= SUNFP_PIHALF)

{

throw new ArgumentException("Sin: Value out of range (-PI/2, PI/2)");

}

return Sin(fpValue) / Cos(fpValue);

}

# 8. Square root, exponentiation and logarithm

## 8.1 Square root

To calculate the square root, Newton's method for approximating the root of any function is used. That is for any function f we want to find x so that f(x) = 0 (see also [4]).

This method starts with an initial guess and calculates iteratively a new guess that is closer to the actual value than the previous value. If we have an approximation of the result value x_{n}, then a better approximation x_{n+1} is calculated as:

We want to find the result of sqrt(x). That is the same as to find the root of the function f(x) = x^{2} - a = 0 because the result is sqrt(a).

The derivation of f(x) is f'(x) = 2*x. So let's insert f(x) and f'(x) into Newton's approximation formula:

x_{n+1} = x - (x^{2} - a) / 2x

= ((x * 2x) / 2x) - ( (x^{2} - a) / 2x)

= ((2x^{2}) / 2x) - ( (x^{2} - a) / 2x)

= (x^{2} + a) / 2x

= 0.5x + 0.5(a + x)

= 0.5 (x + a/x)

Two small things are missing:

- First we need an initial 'guess value' for x
_{0}. For simplicity we just use 1. However the better the initial start value is, the faster the approximation approaches the target value. - Secondly, a termination condition is required to abort the iterative approximation. It's unknown how fast the approximation works, thus two conditions are chosen in my sample implementation: First a maximum of ten iterations are allowed, then the algorithm terminates at the latest. Secondly, the algorithm terminates earlier if a good approximation is found. At each step there is a guess x
_{n}that shall be the result of sqrt(x) - so if x_{n}^{2}is very near to x, the algorithm can be stopped as the required accuracy is reached. Here's a sample implementation:

{

SunFp fpValueResult = 1;

SunFp error = 0.001f;

int iteration = 0;

while ((iteration++ < 10) && (SunFp.Abs((fpValueResult * fpValueResult ) - fpValue) >= error) )

{

fpValueResult = 0.5f * (fpValueResult + (fpValue / fpValueResult ));

}

return fpValueResult ;

}

## 8.2 Logarithm

Two different approaches are implemented to calculate the natural logarithm (to base *e*) - note that there are many more possibilities to approximate the logarithm. The logarithm to any other base than e can be compited by base changing, see e.g. [3].

### 8.2.1 Logarithm using Arctanh

The logarithm can be calculated using the inverse hyperbolic tangent according to following equation:

$\mathrm{log}\left(x\right)=2\times \mathrm{artanh}\left(\frac{x-1}{x+1}\right)$where x must be in range [0, +inf). The inverse hyperbolic tangent itself can be approximated using following Taylor series:

$\mathrm{artanh}\left(x\right)=x+\frac{{x}^{3}}{3}+\frac{{x}^{5}}{5}+\frac{{x}^{7}}{7}+\cdots $The listing of the implementation is left out, have a look inside the source code.

### 8.2.2 Logarithm using Taylor series

The logarithm can be approximated using following Taylor series:

$\mathrm{log}\left(x+1\right)=x-\frac{{x}^{2}}{2}+\frac{{x}^{3}}{3}-\frac{{x}^{4}}{4}+\cdots $This is only valid for input range (-1, 1). Therefore following steps are performed to handle also the general case:

- The logarithm of x, log(x), shall be calculated instead of log(x+1). Thus the input value is subtracted by 1 - this gives us the input range (0, 2).
- To handle all input values, in range [0, +inf), scaling is used. So the input value is scaled in a way that one part falls in range (0, 2) which can be calculated using above Taylor series, the other part is pre-calculated. The fact log(a * b) = log(a) + log(b) is used.
log(25) = log(25/16 * 16) = log(25/16) + log(16).

log(16) is precalculated, while log(25/16) = log(1.5625) is calculated using the Taylor series.^{x}are pre-calculated and stored in a look-up table. Because the largest integer fixed point value is 2^{15}= 32768, the lookup table has entries for x in range [0, 15].

The appropriate divider from the 2^{x} lookup table is searched, then the input value is subtracted by one, then the logarithm is actually computed. Here's the implementation:

{

new SunFp(0), // index: 0 -> LogN(1)

new SunFp(0.6931f), // index: 1 -> LogN(2)

new SunFp(1.3863f), // index: 2 -> LogN(4)

new SunFp(2.0794f), // index: 3 -> LogN(8)

new SunFp(2.7726f), // index: 4 -> LogN(16)

new SunFp(3.4657f), // index: 5 -> LogN(32)

new SunFp(4.1589f), // index: 6 -> LogN(64)

new SunFp(4.8520f), // index: 7 -> LogN(128)

new SunFp(5.5452f), // index: 8 -> LogN(256)

new SunFp(6.2383f), // index: 9 -> LogN(512)

new SunFp(6.9315f), // index: 10 -> LogN(1024)

new SunFp(7.6246f), // index: 11 -> LogN(2048)

new SunFp(8.3178f), // index: 12 -> LogN(4096)

new SunFp(9.0109f), // index: 13 -> LogN(8192)

new SunFp(9.7041f), // index: 14 -> LogN(16384)

new SunFp(10.3972f), // index: 15 -> LogN(32768)

};

public static SunFp LogN_Taylor(SunFp fpValue)

{

int dividerLogLoopUpTableIdx;

if (fpValue < 0)

{

throw new ArgumentException("LogN_Taylor: Value out of range [0, +inf]");

}

SunFp divider = 1;

dividerLogLoopUpTableIdx = 0;

/* find divider so that fpValue >= divider and fpValue < divider + 1 */

SunFp lowerCmp = 2;

for (int exp = 1; exp < LogNDividerLookTable.Length; exp++)

{

SunFp upperCmp = lowerCmp * 2;

if (upperCmp == new SunFp(-32768))

{

/* 2^16 is 32768 due to overflow, subtract one to get 32767 */

upperCmp -= 1;

}

if (fpValue >= lowerCmp && fpValue <= upperCmp)

{

divider = lowerCmp;

dividerLogLoopUpTableIdx = exp;

break;

}

lowerCmp = upperCmp;

}

fpValue = fpValue / divider;

fpValue = fpValue - 1;

// taylor series

SunFp res = fpValue;

res -= (Power(fpValue, 2) / 2);

res += (Power(fpValue, 3) / 3);

res -= (Power(fpValue, 4) / 4);

res += (Power(fpValue, 5) / 5);

res -= (Power(fpValue, 6) / 6);

res += (Power(fpValue, 7) / 7);

// add precalculated 2^x logarithm

res += LogNDividerLookTable[dividerLogLoopUpTableIdx];

return res;

}

## 8.3 Exponentiation

Exponentiation to base *e* is calculated with following equation:

Here's the implementation:

{

SunFp fpRet;

SunFp tmp = 1;

fpRet = 1;

for (int iteration = 1; iteration <= 14; iteration++)

{

tmp = 1;

for (int i = 1; i <= iteration; i++)

{

tmp = tmp * (v / i);

}

fpRet = fpRet + tmp;

}

return fpRet;

}

## 9. Conclusion & References

That's it! Those were the notes of my own try to implement a fixed-point class. Hope it was interesting for you and have fun!

Sunshine, April 2017

### References

[1] Fixed-point arithmetic @ Wikipedia[2] Randy Yates - Fixed-Point Arithmetic: An Introduction

[3] Logarithm @ Wikipedia

[4] Newton iterative sqrt method

#### History

- 2017-04-18: First version
- 2021-03-13: Corrected the Taylor series for cosine from cos = x - (...) to cos = 1 - (...).