Table of Contents
- 1. Foreword & Introduction
- 2. Motivation
- 3. Basics
- 4. Creating fixed-point numbers
- 5. Basic binary operations
- 5.1 Addition
- 5.2 Subtraction
- 5.3 Multiplication
- 5.4 Division
- 6. Basic unary operations
- 7. Trigonometric functions
- 8. Square root, exponentiation and logarithm
- 8.1 Square root
- 8.2 Logarithm
- 8.2.1 Logarithm using Arctanh
- 8.2.2 Logarithm using Taylor series
- 8.3 Exponentiation
- 9. Conclusion & References
1. Foreword & Introduction
This article presents my own approach of implementing a fixed point number class. It's not the best nor most efficient implementation, but it hopefully helps in understanding fixed point numbers. And of course, no guarantee for correctness... ;-)
After summarizing shortly the theoretical basics, the fixed point number format along with the most common mathematical operations are developed and discussed step by step. This tutorial is based on 16.16 fixed point numbers, but the explanations should be general enough to easily extended them to any fixed point number format.
2. Motivation
Why another article on this topic you might ask?
Well, the goal is to have a very practical tutorial with examples for each operation to quickly grasp the topic. It shall not be over-complicated, still it should be not wrong at all and should provide the basics in order to have a nice starting point to dive into the fixed point number subject.
What's the use of fixed point numbers?
Some older or low-power microcontrollers do not have a floating point number unit, thus cannot calculate decimal numbers in hardware. So the handling of decimal numbers must be performed in software. As an example the first versions of J2ME (Java Platform, Micro Edition) used on mobile devices had no native support for decimal numbers.
Also for some special use cases like in DSP, fixed point numbers could turn out more practical and efficient. And by the way, it's an interesting topic.
3. Basics
Fixed points numbers are represented using integer values. As the name implies, the position of the point between the integer and fractional part is fixed - meaning the number of bits representing the integer part as well as the number of bits representing the fractional part is always the same. That's a huge difference to floating point formats like IEEE 754 where the number of bits used for the integer part and fractional parts differs from number to number, gaining a much better accuracy and wider range of presentable numbers.
In the following, the values NUM_INT_BITS and NUM_FRAC_BITS define the number of bits used for representing the integer respectively the fractional part of a decimal number.
The examples and implementation uses all a 16.16 representation, meaning:
- NUM_INT_BITS = 16
- NUM_FRAC_BITS = 16
- NUM_TOTAL_BITS = NUM_INT_BITS + NUM_FRAC_BITS = 32
Thus a single fixed point number is represented by a 32 bit value as follows:
Index | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | -1 | -2 | -3 | -4 | -5 | -6 | -7 | -8 | -9 | -10 | -11 | -12 | -13 | -14 | -15 | -16 |
Weight | 215 | 214 | 213 | 212 | 211 | 210 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 2-1 | 2-2 | 2-3 | 2-4 | 2-5 | 2-6 | 2-7 | 2-8 | 2-9 | 2-10 | 2-11 | 2-12 | 2-13 | 2-14 | 2-15 | 2-16 |
Bitvalue | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x |
Here the weight of each bit that contributes to the actual numerical value is explicitly noted. To have a shorter representation, all subsequent binary representations of fixed point numbers are shown without indices and weights:
x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x |
The bit value at position i is denoted as bi, where i is the index in range of {15, ... 1, 0, -1, ..., -16}.
3.1 Ranges and representations
INFO
This fixed point number approach targets signed numbers, so all subsequent explanations and definitions addresses only signed numbers. However, handling only unsigned numbers should be quite easy after reading this article.
The integer part can contain 2NUM_INT_BITS values. Because we want to represent also negative values, the two-complement form is used.
This means the integer as itself can take values of range { -2NUM_INT_BITS, 2NUM_INT_BITS - 1 }.
More general, because we use a NUM_TOTAL_BITS = 32 bit, the range is
{ 2NUM_TOTAL_BITS / 2NUM_FRAC_BITS - 1, -2NUM_TOTAL_BITS / 2NUM_FRAC_BITS } = { 232 / 216 - 1, -232 / 216 } = { 216 -1, -216 } = { 32767, -32767 }.
The fractional part consists of NUM_FRAC_BITS, with the weight ranging from 2-1 to 2-16.
So the biggest fractional smaller than the next integer value is 2-1 + 2-2 + ... + 2-NUM_FRAC_BITS, with NUM_FRAC_BITS = 16 this result in ≈ 0,999985. This is by the way the same as 2 - 2-16 / (2-16 - 1) ≈ 0,999985.
The smallest possible difference between two numbers is given by 2-16 ≈ 0,00001526.
In general, the value of a fixed-point decimal is given by the equation:
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
The value x of this fixed-point number is calculated as:
1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
The value x of this fixed-point number is calculated as:
Let's have a look at some interesting and boundary values:
- The number zero is trivial - all bits are zero.
- The biggest number that can be represented is as follows:
0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
- The smallest number that can be represented is as follows:
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4. Creating fixed-point numbers
To work with fixed-point numbers, the first implementation step is to create fixed-point numbers with the desired value.
4.1 From integer numbers
This is the easiest case. Having an integer value, just shift it the number of fractional bits to the left to skip the fractional part. This also works for negative values as integer values are stored the same way (in two-complement from) as the integer part of our fixed-point numbers.
The result of 12 << 16 is:
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Hex value: 0x00C00000 |
4.2 From floating-point numbers
At first, a constant SUNFP_ONE is introduced that represents the value 1 in fixed-point format. Using the previous chapter, this is just 1 << NUM_FRAC_BITS, thus 0x00010000. This constant is used to build and create a fractional value between 0 and 1 in fixed-point number - just multiply it with SUNFP_ONE!
Calculate 0.25 * SUNFP_ONE = 0.25 * 0x00010000 = 0.25 * 65536 = 16384 = 0x00004000.
So the bit at position -2 is only set, so the value is -22 = 0.25!
Summarized, creating a fixed-point number from a floating-point number can be done in two steps. First convert the integer part, then the fractional part and finally add both values.
int intPart = (int)x;
// get fractional part of passed value
float fracPart = x - intPart;
// convert the integer part to fixed point
fpValue = intPart << NUM_FRAC_BITS;
// convert the fractional part to fixed point and add it to converted integer part
fpValue += (int)(SUNFP_ONE * fracPart);
This can be shortened to following one-liner!
4.3 From strings
Basic idea is to find the decimal point: the part in front of the decimal point is the integer part and this can be directly converted from string to int. Then the integer conversion from chapter 4.1.1 can be used.
Same applies to the case if no decimal point can be found in the string - then there is no fractional part at all, only an integer part.
The handling of the fractional part is a bit more complicated. Take the fractional part as itself and convert it to an integer. This is multiplied by SUNFP_ONE. This results in a fixed point number with the value of the fractional part - but that is too large by a power of ten! So count the number of characters in the fractional part and divide the intermediate FP by 10<number of characters>. This gives the correct fractional part that is added to the integer part.
The "3" before the decimal point is converted to integer and then to fixed-point number, this gives 0x0003000.
The fractional part "75" after the decimal point is also converted to integer, this gives 75. This is multiplied by SUNFP_ONE, so 75 * 0x00010000 = 0x4B0000.
The fractional part "75" has two digits, so divide 0x4B0000 by 102 = 0x4B0000 / 100 = 0xC000.
Finally both parts are added: 0x00030000 + 0xC000 = 0x0003C000. That is our fixed point number.
Recheck the represented value: 0x0003C000 = b0000 0000 0000 0011 1100 0000 0000 0000 -> 21 + 20 + 2-1 + 2-2 = 3.75!
However, what about negative numbers? Consider the number "-4.324". In short, the calculation would be roughly as -4 + 324/1000 = -3,676!
To fix it, the integer part is subtracted by 1 and then "1 - fractional part" is added. In the case of "4.32" this would be -4 - 1 + ( 1 - 324/1000) = -5 + 676/100 = -4,324. Yes!
So here a possible implementation:
{
s = s.Trim();
if (s.Contains('.'))
{
string[] parts = s.Split(new char[] {'.'}, 2);
// get integer part
fpValue = GetIntFromStr(parts[0]) << NUM_FRAC_BITS;
// get fractional part
if (parts.Length > 1)
{
int numFracDigits = parts[1].Length;
int fracInt = GetIntFromStr(parts[1]);
fracInt = fracInt * SUNFP_ONE / (int)Math.Pow(10, numFracDigits);
if (fpValue < 0)
{
fpValue -= SUNFP_ONE;
fracInt = SUNFP_ONE - fracInt;
}
fpValue |= (fracInt & MASK_FRAC_BITS);
}
}
else
{
/* no fractional part, try to parse it as int */
fpValue = GetIntFromStr(s) << NUM_FRAC_BITS;
}
}
5. Basic binary operations
In this chapter, following basic operations are discussed: addition, subtraction, multiplication and division.
5.1 Addition
Addition is actually ridiculous easy - just adding two fixed-point numbers result in the correct value. This works as we use the same 16.16 format for all numbers. Of course, if the decimal point is not at the same position for both numbers, special care has to be taken. As normal integer addition is used, there is also the problem of overflowing, e.g 32767 + 1 = -32768.
{
SunFp fpValueResult = fpValue1 + fpValue2;
return fpValueResult;
}
5.2 Subtraction
The same as for addition also applies for subtraction - just use integer subtraction, nothing interesting here.
{
SunFp fpValueResult = fpValue1 - fpValue2;
return fpValueResult;
}
5.3 Multiplication
Let's see if two fixed point numbers can just be multiplied to get the correct value, similar to addition and subtraction:
The numbers 4 and 5 have the values in fixed point format 0x00040000 and 0x00050000. Calculating 0x00040000 * 0x00050000 results in 0x1400000000!
Two problems are obvious here:
- At first, the result is far too big. The expected result of 4 * 5 is 20 (0x14), that in fixed point format 0x00140000.
So the calculated value 0x1400000000 is too large by factor 0x10000 = 216. - Secondly, this value does also not fit into our 16.16 format.
So first address the result issue itself: Because the result is too large by factor 216, divide it by this value which is actually just shifting it by 16 (more specific: by NUM_FRAC_BITS) to the right.
NOTE
Shifting one of the two operands by 16 before the multiplication won't work because that will lose the fractional part.
Example: Let's multiply 1.5 by 1.5, that would be 0x00011000 * 0x00011000. If one operand is shifted first by sixteen, it would result in 0x00011000 * 0x00000001 = 0x00011000. This is again 1.5 because one operand was shorted by the shift from 1.5 to 1, so actually 1.5 * 1 was calculated.
Shifting the result by 16 leads to following possible implementation:
{
SunFp fpValueResult = (int)((long)(fpValue1 * fpValue2) >> 16);
return fpValueResult;
}
However, there is still the small issue that the intermediate result does not fit into our 16.16 fixed point format, therefore this intermediate result has to be cased to long before shifting.
A possible solution is to treat the integer and fractional parts of both operands separately as following examples demonstrates:
The result of 3.5 * 2.4 is 8.4. This can also be calculated as follow:
3.5 * 2.4
= 3 * 2.4 + 0.5 * 2.4
= 3 * 2 + 3 * 0.4 + 0.5 * 2 + 0.5 * 0.4
= 6 + 1.2 + 1 + 0.2 = 8.4.
This can be implemented as follows:
{
int intPart1 = fpValue1 >> NUM_FRAC_BITS;
int intPart2 = fpValue2 >> NUM_FRAC_BITS;
int fracPart1 = fpValue1 & MASK_FRAC_BITS;
int fracPart2 = fpValue2 & MASK_FRAC_BITS;
int fpValueResult = 0;
fpValueResult += (intPart1 * intPart2) << NUM_FRAC_BITS;
fpValueResult += (intPart1 * fracPart2);
fpValueResult += (fracPart1 * intPart2);
fpValueResult += ((fracPart1 * fracPart2) >> NUM_FRAC_BITS) & MASK_FRAC_BITS;
}
5.4 Division
After having discussed multiplication, it's not surprising that similar issues arise with division. Instead of dividing the result by 216, it has to be multiplied by 216, thus shifted to the left by 16. However, the shift must now occur before the actual division, otherwise precision is lost.
Let's divide 6.25 by 2.5 (= 2.5). In fixed point format, that is 0x00064000 divided by 0x00028000. However, 0x00064000 / 0x00028000 = 0x00000002 = 2, and not 2.5 as expected!
So the dividend has to be shifted to the left by 16 before division:
0x00064000 << 16) / 0x00028000 = 0x0000000640000000 / 0x00028000 = 0x00028000 which is 2.5 in fixed point format.
This means again that we need a long data type for implementation:
{
int fpValueResult = (int)((long)(fpValue1 << 16) / fpValue2);
}
The only way that came into my mind to avoid the long data type is follows but unfortunately it introduces a small loss of precision. (If you know a better solution, I would be glad to be informed about it - feel free to contact me!):
Basic idea is to replace the division by multiplication with the reciprocal, i.e. instead of a / b, it's calculated a * (1 / b).
Well, nothing gained, there is still a division left. And the reciprocal 1 / a of value a is in fixed point format (1 << 32) / a.
Again a long is needed to hold the value 1 << 32...
Here comes the step that avoids the long however at which also precision is lost: It is shifted only by 31 bits during reciprocal calculation, then the multiplication is performed, and afterwards the last shift by one to the left is performed.
Let's compare both values with example.
a) Using 32bit left shift:
First calculate 1 / 2.5. That's ((1 << 32) / 0x00028000) = 0x00006666 (≈ 0.39999).
Then final calculation: 20 * 0.39999 , so in FP format 0x00140000 * 0x00006666 = 0x7FFF80000. Finally right-shift by 16 to right gives 0x7FFF8, that actually ≈ 7.9998.
b) Using 31bit left shift:
That's ((1 << 31) / 0x00028000) = 0x80000000 / 0x00028000 = 0x3333.
Now do the multiplication and shifting: (0x00140000 * 0x000003333) >> 16 = 0x3FFFC0000 >> 16 = 0x0003FFFC.
Left-shift this result to the left by one: 0x00003FFC << 1 = 0x0007FFF8, that's again ≈ 7.9998.
Compare it to the variant using long:
Let's divide 20 by 2.5 (= 8).
(0x00140000 << 16) / 0x00028000 = 0x001400000000 / 0x00028000 = 80000 which is 8 in fixed point format.
Let's see the implementation without using long data type:
{
uint v2reciprocal = 1;
v2reciprocal <<= 31;
v2reciprocal = (uint)(v2reciprocal / fpValue2);
int fpValueResult = fpValue1 * v2reciprocal;
fpValueResult <<= 1;
}
6. Basic unary operations
In this chapter, following basic unary operations are discussed: floor, ceil, round and absolute.
6.1 Floor
The floor function applied to a decimal number x gives the greatest integer number that is equal or less than x, e.g. floor(1.2) = 1 and floor (-2.4) = 3.
Due to the fact how this fixed point number is represented, the implementation is very easy - just mask out the fractional part. This also works for negative numbers as following examples make clear:
a) Floor(1.25):
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | (Hex 0x00012000) --> Decimal value: 1.25 |
Mask out the fractional part (lowest 16 bits) results in:
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | (Hex 0x00010000) --> Decimal value: 1 |
b) Floor(-11.625):
1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | (Hex 0xFFF46000) --> Decimal value: -11.625 |
Mask out the fractional part (lowest 16 bits) results in:
1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | (Hex 0xFFF40000) --> Decimal value: -12 |
Decimal numbers that are actually integer values are the trivial case, because in fixed-point format all fractional bits are zero, thus masking them out does not change the actual value, so e.g. floor(-12) = -12.
Here's the implementation:
{
SunFp fpValueResult = (int)(fpValue & 0xFFFF);
return fpValueResult;
}
6.2 Ceil
The ceil function applied to a decimal number x gives the greatest integer number that is equal or greater than x, e.g. ceil(1.2) = 2 and floor (-2.4) = -2.
Ceil and floor are tightly coupled, so an approach is to implement the ceil function using the floor function: Ceil(x) = Floor(x+1). However this works not for plain integer numbers, as the result is then off by one, so this case has to be treated specially:
- If x has a decimal part (no plain integer), then Ceil(x) = Floor(x+1).
- If x has no decimal part (plain integer), then Ceil(x) = x.
a) Ceil(1.25) - no plain integer:
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | (Hex 0x00012000) --> Decimal value: 1.25 |
Add one to the value (FP fpormat 0x00010000), then mask out the fractional part (lowest 16 bits) results in:
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | (Hex 0x00020000) --> Decimal value: 2 |
b) Ceil(-11.625) - no plain integer:
1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | (Hex 0xFFF46000) --> Decimal value: -11.625 |
Add one to the value (FP fpormat 0x00010000), then mask out the fractional part (lowest 16 bits) results in:
1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | (Hex 0xFFF50000) --> Decimal value: -12 |
c) Ceil(-4) - plain integer:
1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | (Hex 0xFFFFC000) --> Decimal value: -4 |
-4 is a plain integer (fractional part is zero), thus nothing has to be done.
Here's a possible implementation:
{
SunFp fpValueResult = 0;
if ((fpValue & 0x0000FFFF) == 0)
{
fpValueResult = fpValue;
}
else
{
fpValueResult = ((fpValue + SUNFP_ONE) & 0x0000FFFF);
}
return fpValueResult ;
}
6.3 Round
The round function applied to a decimal number x gives the nearest integer number to x, e.g. round(1.2) = 1, round(1.7) = 2 and round(-2.4) = -2.
Again the floor function can be used to implement the round function as follows:
- If x >= 0, then round(x) = floor(x + 0.5).
- If x < 0, then round(x) = floor(x - 0.5).
Due to the way negative numbers are represented in this fixed-point format, always the first implementation case is required, thus round(x) = floor(x + 0.5).
a) Round(1.25):
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | (Hex 0x00012000) --> Decimal value: 1.25 |
Add 0.5 to the value (FP format 0x00008000) gives 1.75 (FP format 0x0001C000), then mask out the fractional part (lowest 16 bits) results in:
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | (Hex 0x00010000) --> Decimal value: 1 |
a) Round(1.75):
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | (Hex 0x0001C000) --> Decimal value: 1.75 |
Add 0.5 to the value (FP format 0x00008000) gives 2.25 (FP format 0x00024000), then mask out the fractional part (lowest 16 bits) results in:
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | (Hex 0x00020000) --> Decimal value: 2 |
Note that the rounding of value .5 is ambiguous, e.g. the distance from 2.5 to 2 and from 2.5 to 3 is the same, thus the rounding of values with decimal part .5 has to be defined. This implementation returns the larger value in this case.
Here's the implementation:
public static SunFp Round(SunFp fpValue)
{
SunFp fpValueResult = 0;
fpValueResult = fpValue + SUNFP_HALF;
fpValueResult = SunFp.Floor(fpValueResult);
return fpRet;
}
6.4 Absolute
The abs function applied to a decimal number x gives the positive value of x independent of the sign of x, e.g. abs(4.5) = 4.5 and abs(-4.5) = 4.5.
Actually this can be easily implemented by handling two distinct cases:
- If x >= 0, just return x.
- If x < 0, then subtract x from zero, thus abs(x) = 0 - x.
Abs(-0.5): The value -0.5 is represented in FP format as 0xffff8000. 0x00000000 - 0xffff8000 gives 0x00008000 (note the truncation of the left-most bit), that's 0.5.
7. Trigonometric functions
In this chapter, following trigonometric functions are discussed: Sine, cosine and tangent.
7.1 Sine
To calculate the sine of decimal value x, an approximation using Taylor series is used:
The more terms are used, the better the accurary of the result will be. Take shall be taken of overflow: Already 9! = 362880 does not fit into the 16bit integer part. However x9/9! can also be written as x9/(7! * 72) and 7! still fits into a 16bit number.
Note that the input value must be in range [-PI, PI] - the transformation of any value to that range is easy, but not part of following implementation:
{
SunFp fpValueResult = fpValue;
SunFp tmp;
SunFp fac7 = Factorial(7);
if (fpValue < -SUNFP_PI || fpValue > SUNFP_PI)
{
throw new ArgumentException("Sin: Value out of range [-PI, PI]");
}
tmp = Power(fpValue, 3) / Factorial(3);
fpValueResult = fpValueResult - tmp;
tmp = Power(fpValue, 5) / Factorial(5);
fpValueResult = fpValueResult + tmp;
tmp = Power(fpValue, 7) / fac7;
fpValueResult = fpValueResult - tmp;
/* 9! will overflow the 16bit integer part, so calc step by step */
tmp = Power(fpValue, 9) / fac7;
tmp = tmp / (9 * 8);
fpValueResult = fpValueResult + tmp;
return fpValueResult ;
}
7.2 Cosine
To calculate the cosine of decimal value x, again an approximation using Taylor series is used:
Also here the input value must be in range [-PI, PI] . As the implementation is very similar to sine, the listing is left out.
7.3 Tangent
The easiest calculation of tangent is using the equation tan(x) = sin(x) / cos(x) because the functions for sine and cosine already exist:
{
if (fpValue <= -SUNFP_PIHALF || fpValue >= SUNFP_PIHALF)
{
throw new ArgumentException("Sin: Value out of range (-PI/2, PI/2)");
}
return Sin(fpValue) / Cos(fpValue);
}
8. Square root, exponentiation and logarithm
8.1 Square root
To calculate the square root, Newton's method for approximating the root of any function is used. That is for any function f we want to find x so that f(x) = 0 (see also [4]).
This method starts with an initial guess and calculates iteratively a new guess that is closer to the actual value than the previous value. If we have an approximation of the result value xn, then a better approximation xn+1 is calculated as:
We want to find the result of sqrt(x). That is the same as to find the root of the function f(x) = x2 - a = 0 because the result is sqrt(a).
The derivation of f(x) is f'(x) = 2*x. So let's insert f(x) and f'(x) into Newton's approximation formula:
xn+1 = x - (x2 - a) / 2x
= ((x * 2x) / 2x) - ( (x2 - a) / 2x)
= ((2x2) / 2x) - ( (x2 - a) / 2x)
= (x2 + a) / 2x
= 0.5x + 0.5(a + x)
= 0.5 (x + a/x)
Two small things are missing:
- First we need an initial 'guess value' for x0. For simplicity we just use 1. However the better the initial start value is, the faster the approximation approaches the target value.
- Secondly, a termination condition is required to abort the iterative approximation. It's unknown how fast the approximation works, thus two conditions are chosen in my sample implementation: First a maximum of ten iterations are allowed, then the algorithm terminates at the latest. Secondly, the algorithm terminates earlier if a good approximation is found. At each step there is a guess xn that shall be the result of sqrt(x) - so if xn2 is very near to x, the algorithm can be stopped as the required accuracy is reached. Here's a sample implementation:
{
SunFp fpValueResult = 1;
SunFp error = 0.001f;
int iteration = 0;
while ((iteration++ < 10) && (SunFp.Abs((fpValueResult * fpValueResult ) - fpValue) >= error) )
{
fpValueResult = 0.5f * (fpValueResult + (fpValue / fpValueResult ));
}
return fpValueResult ;
}
8.2 Logarithm
Two different approaches are implemented to calculate the natural logarithm (to base e) - note that there are many more possibilities to approximate the logarithm. The logarithm to any other base than e can be computed by base changing, see e.g. [3].
8.2.1 Logarithm using Arctanh
The logarithm can be calculated using the inverse hyperbolic tangent according to following equation:
where x must be in range [0, +inf). The inverse hyperbolic tangent itself can be approximated using following Taylor series:
The listing of the implementation is left out, have a look inside the source code.
8.2.2 Logarithm using Taylor series
The logarithm can be approximated using following Taylor series:
This is only valid for input range (-1, 1). Therefore following steps are performed to handle also the general case:
- The logarithm of x, log(x), shall be calculated instead of log(x+1). Thus the input value is subtracted by 1 - this gives us the input range (0, 2).
- To handle all input values, in range [0, +inf), scaling is used. So the input value is scaled in a way that one part falls in range (0, 2) which can be calculated using above Taylor series, the other part is pre-calculated. The fact log(a * b) = log(a) + log(b) is used.
log(25) = log(25/16 * 16) = log(25/16) + log(16).
log(16) is precalculated, while log(25/16) = log(1.5625) is calculated using the Taylor series.
The appropriate divider from the 2x lookup table is searched, then the input value is subtracted by one, then the logarithm is actually computed. Here's the implementation:
{
new SunFp(0), // index: 0 -> LogN(1)
new SunFp(0.6931f), // index: 1 -> LogN(2)
new SunFp(1.3863f), // index: 2 -> LogN(4)
new SunFp(2.0794f), // index: 3 -> LogN(8)
new SunFp(2.7726f), // index: 4 -> LogN(16)
new SunFp(3.4657f), // index: 5 -> LogN(32)
new SunFp(4.1589f), // index: 6 -> LogN(64)
new SunFp(4.8520f), // index: 7 -> LogN(128)
new SunFp(5.5452f), // index: 8 -> LogN(256)
new SunFp(6.2383f), // index: 9 -> LogN(512)
new SunFp(6.9315f), // index: 10 -> LogN(1024)
new SunFp(7.6246f), // index: 11 -> LogN(2048)
new SunFp(8.3178f), // index: 12 -> LogN(4096)
new SunFp(9.0109f), // index: 13 -> LogN(8192)
new SunFp(9.7041f), // index: 14 -> LogN(16384)
new SunFp(10.3972f), // index: 15 -> LogN(32768)
};
public static SunFp LogN_Taylor(SunFp fpValue)
{
int dividerLogLoopUpTableIdx;
if (fpValue < 0)
{
throw new ArgumentException("LogN_Taylor: Value out of range [0, +inf]");
}
SunFp divider = 1;
dividerLogLoopUpTableIdx = 0;
/* find divider so that fpValue >= divider and fpValue < divider + 1 */
SunFp lowerCmp = 2;
for (int exp = 1; exp < LogNDividerLookTable.Length; exp++)
{
SunFp upperCmp = lowerCmp * 2;
if (upperCmp == new SunFp(-32768))
{
/* 2^16 is 32768 due to overflow, subtract one to get 32767 */
upperCmp -= 1;
}
if (fpValue >= lowerCmp && fpValue <= upperCmp)
{
divider = lowerCmp;
dividerLogLoopUpTableIdx = exp;
break;
}
lowerCmp = upperCmp;
}
fpValue = fpValue / divider;
fpValue = fpValue - 1;
// taylor series
SunFp res = fpValue;
res -= (Power(fpValue, 2) / 2);
res += (Power(fpValue, 3) / 3);
res -= (Power(fpValue, 4) / 4);
res += (Power(fpValue, 5) / 5);
res -= (Power(fpValue, 6) / 6);
res += (Power(fpValue, 7) / 7);
// add precalculated 2^x logarithm
res += LogNDividerLookTable[dividerLogLoopUpTableIdx];
return res;
}
8.3 Exponentiation
Exponentiation to base e is calculated with following equation:
Here's the implementation:
{
SunFp fpRet;
SunFp tmp = 1;
fpRet = 1;
for (int iteration = 1; iteration <= 14; iteration++)
{
tmp = 1;
for (int i = 1; i <= iteration; i++)
{
tmp = tmp * (v / i);
}
fpRet = fpRet + tmp;
}
return fpRet;
}
9. Conclusion & References
That's it! Those were the notes of my own try to implement a fixed-point class. Hope it was interesting for you and have fun!
Sunshine, April 2017
References
[1] Fixed-point arithmetic @ Wikipedia[2] Randy Yates - Fixed-Point Arithmetic: An Introduction
[3] Logarithm @ Wikipedia
[4] Newton iterative sqrt method
History
- 2017-04-18: First version
- 2021-03-13: Corrected the Taylor series for cosine from cos = x - (...) to cos = 1 - (...).
- 2022-07-03: Added definition of SUNFP_ONE, changed FP_ONE to SUNFP_ONE for consistency, fixed some typos.