问题:浮点数学运算是否被破坏?

考虑以下代码:

0.1 + 0.2 == 0.3  ->  false
0.1 + 0.2         ->  0.30000000000000004

为什么会出现这些错误?

标签:math,language-agnostic,floating-point,floating-accuracy

Q: Is floating point math broken?

Consider the following code:

0.1 + 0.2 == 0.3  ->  false
0.1 + 0.2         ->  0.30000000000000004

Why do these inaccuracies happen?

回答1:

二进制浮点数学就是这样。在大多数编程语言中,它基于 IEEE 754标准。问题的症结在于数字以这种格式表示为整数乘以2的幂。分母不是2的幂的有理数(例如0.1,是1/10)无法准确表示。

对于标准binary64格式的0.1,表示形式可以完全一样

相反,有理数0.1(即1/10)可以精确地写为

  • 0.1(十进制)或
  • 0x1.99999999999999...p-4(类似于C99十六进制表示法),其中...表示无休止的9序列。

程序中的常量0.20.3也将接近其真实值。碰巧,最接近0.2double大于有理数0.2,但是最接近double code> 0.3 小于有理数0.30.10.2的总和大于有理数0.3,因此与代码中的常数不一致。

对浮点算术问题的相当全面的处理是 每位计算机科学家应该了解的浮点运算法则 。有关更容易理解的说明,请参见 floating-point-gui.de

侧面说明:所有位置(以N为基数)的数字系统均会精确地遇到此问题

普通的旧十进制(以10为底)数字也存在相同的问题,这就是为什么1/3之类的数字最终等于0.333333333 ...

您偶然发现了一个数字(3/10),该数字很容易用十进制表示,但不适合二进制。它也是双向的(在某种程度上):1/16是一个丑陋的数字,十进制(0.0625),但是在二进制中,它看起来像10,000十进制(0.0001)**一样整洁-如果我们在习惯于在我们的日常生活中使用基数2的数字系统,您甚至会查看该数字,并本能地理解将某物减半,一次又一次减半可以到达那里。

**当然,这并不完全是将浮点数存储在内存中的方式(它们使用科学计数形式)。但是,它确实说明了二进制浮点精度误差趋于增加的观点,因为我们通常感兴趣的"真实世界"数字通常是10的幂-但这仅仅是因为我们使用了十进制数天-今天。这也是为什么我们要说71%而不是"每7个中有5个"(71%是一个近似值,因为5/7不能用任何十进制数字精确表示)的原因。

因此,没有:二进制浮点数没有被破坏,它们恰好与其他所有基数N的系统一样不完善:)

侧面说明:在编程中使用浮点数

实际上,这种精度问题意味着您需要使用舍入函数将浮点数四舍五入为您感兴趣的小数位数,然后再显示它们。

您还需要用允许一定容忍度的比较替换相等性测试,这意味着:

如果不是(x == y){...}

,请

请执行if(abs(x-y)

其中abs是绝对值。需要为您的特定应用选择myToleranceValue-这与您准备允许多少"摆动空间"以及要比较的最大数字有很大关系。 be(由于精度问题)。当心所选语言中的" epsilon"样式常量。这些不是用作公差值。

A1:

Binary floating point math is like this. In most programming languages, it is based on the IEEE 754 standard. The crux of the problem is that numbers are represented in this format as a whole number times a power of two; rational numbers (such as 0.1, which is 1/10) whose denominator is not a power of two cannot be exactly represented.

For 0.1 in the standard binary64 format, the representation can be written exactly as

  • 0.1000000000000000055511151231257827021181583404541015625 in decimal, or
  • 0x1.999999999999ap-4 in C99 hexfloat notation.

In contrast, the rational number 0.1, which is 1/10, can be written exactly as

  • 0.1 in decimal, or
  • 0x1.99999999999999...p-4 in an analogue of C99 hexfloat notation, where the ... represents an unending sequence of 9's.

The constants 0.2 and 0.3 in your program will also be approximations to their true values. It happens that the closest double to 0.2 is larger than the rational number 0.2 but that the closest double to 0.3 is smaller than the rational number 0.3. The sum of 0.1 and 0.2 winds up being larger than the rational number 0.3 and hence disagreeing with the constant in your code.

A fairly comprehensive treatment of floating-point arithmetic issues is What Every Computer Scientist Should Know About Floating-Point Arithmetic. For an easier-to-digest explanation, see floating-point-gui.de.

Side Note: All positional (base-N) number systems share this problem with precision

Plain old decimal (base 10) numbers have the same issues, which is why numbers like 1/3 end up as 0.333333333...

You've just stumbled on a number (3/10) that happens to be easy to represent with the decimal system, but doesn't fit the binary system. It goes both ways (to some small degree) as well: 1/16 is an ugly number in decimal (0.0625), but in binary it looks as neat as a 10,000th does in decimal (0.0001)** - if we were in the habit of using a base-2 number system in our daily lives, you'd even look at that number and instinctively understand you could arrive there by halving something, halving it again, and again and again.

** Of course, that's not exactly how floating-point numbers are stored in memory (they use a form of scientific notation). However, it does illustrate the point that binary floating-point precision errors tend to crop up because the "real world" numbers we are usually interested in working with are so often powers of ten - but only because we use a decimal number system day-to-day. This is also why we'll say things like 71% instead of "5 out of every 7" (71% is an approximation, since 5/7 can't be represented exactly with any decimal number).

So no: binary floating point numbers are not broken, they just happen to be as imperfect as every other base-N number system :)

Side Side Note: Working with Floats in Programming

In practice, this problem of precision means you need to use rounding functions to round your floating point numbers off to however many decimal places you're interested in before you display them.

You also need to replace equality tests with comparisons that allow some amount of tolerance, which means:

Do not do if (x == y) { ... }

Instead do if (abs(x - y) < myToleranceValue) { ... }.

where abs is the absolute value. myToleranceValue needs to be chosen for your particular application - and it will have a lot to do with how much "wiggle room" you are prepared to allow, and what the largest number you are going to be comparing may be (due to loss of precision issues). Beware of "epsilon" style constants in your language of choice. These are not to be used as tolerance values.

回答2:

硬件设计师的观点

自从我设计和构建浮点硬件以来,我相信我应该为此添加硬件设计师的观点。知道错误的根源可能有助于理解软件中发生的事情,最终,我希望这有助于解释为什么会出现浮点错误并随着时间的推移而累积的原因。

1。概述

从工程的角度来看,大多数浮点运算都将包含一些错误元素,因为进行浮点计算的硬件仅要求最后的误差小于一个单元的一半。因此,很多硬件将停止精度,而对于单次操作,最后要产生的误差小于一个单元的一半,这在浮点除法中尤其成问题。构成单个操作的要素取决于该单元采用的操作数。对于大多数情况,它是两个,但是某些单位需要3个或更多操作数。因此,不能保证重复操作会导致理想的错误,因为随着时间的推移这些错误加起来。

2。标准

大多数处理器遵循 IEEE-754 标准,但有些处理器使用非规范化或其他标准。例如,IEEE-754中存在一种非规范化模式,该模式允许以精度为代价表示非常小的浮点数。但是,以下内容将介绍IEEE-754的标准化模式,这是典型的操作模式。

在IEEE-754标准中,允许硬件设计人员将error / epsilon的值设为任何值,只要该值小于最后一个单位的一半,并且结果只需小于一个单位的一半即可。最后一次操作。这解释了为什么当重复操作时,错误加起来。对于IEEE-754双精度,这是第54位,因为53位用于表示浮点数的数字部分(规格化),也称为尾数(例如5.3e5中的5.3)。下一节将更详细地介绍各种浮点运算中的硬件错误原因。

3。除法中舍入错误的原因

浮点除法错误的主要原因是用于计算商的除法算法。大多数计算机系统都使用乘以逆的乘法来计算除法,主要是在Z=X/YZ=X*(1/Y)中。迭代计算除法,即每个周期计算商的某些位,直到达到所需的精度为止,对于IEEE-754,这是最后一个误差小于一个单位的任何东西。 Y(1 / Y)的倒数表在慢除法中称为商选择表(QST),商选择表的位大小通常为基数的宽度,或者为每次迭代中计算出的商,加上一些保护位。对于IEEE-754标准(双精度(64位)),它是除法器基数的大小,加上几个保护位k,其中k>=2。因此,例如,用于一次计算2位商(基数4)的除法器的典型商选择表将是2+2=4位(加上一些可选位)。

3.1除法舍入误差:倒数的近似值

商选择表中的倒数取决于除法:慢速部门(例如SRT部门)或快速部门(例如Goldschmidt部门);根据划分算法修改每个条目,以尝试产生尽可能低的错误。但是,无论如何,所有倒数都是实际倒数的近似值,并且会引入一些误差元素。慢速除法和快速除法方法都是迭代计算商,即每步计算商的位数,然后从被除数中减去结果,然后除法器重复执行这些步骤,直到误差小于二分之一。单位放在最后。慢除法在每个步骤中计算商的位数固定,通常构建成本较低,而快速除法在每步中计算可变数位数,通常构建成本较高。除法中最重要的部分是,大多数方法都依赖于对数的逼近的重复乘法,因此容易出错。

4。四舍五入其他操作中的错误:截断

所有操作中舍入错误的另一个原因是IEEE-754允许的截断最终答案的不同模式。截断,零舍入,舍入到最近(默认)舍入-向下,向上取整。对于单个操作,所有方法最后都会引入误差小于1个单位的元素。随着时间的流逝和重复的操作,截断还会累积地导致错误。这种截断错误在求幂时特别成问题,涉及某种形式的重复乘法。

5。重复操作

由于进行浮点计算的硬件只需要产生一个结果,该结果的单个操作的最后一个位置的误差小于一个单元的一半,因此如果不注意,该误差将随着重复的操作而增大。这就是为什么在需要有限误差的计算中,数学家使用诸如舍入至最近 IEEE-754的最后一位为偶数位,因为随着时间的流逝,错误更有可能相互抵消,并且间隔算术 IEEE的变体组合754种舍入模式,以预测并纠正舍入误差。由于与其他舍入模式相比,它的相对误差较低,因此舍入到最接近的偶数位(最后一位)是IEEE-754的默认舍入模式。

请注意,默认的舍入模式为最近舍入最后一位为偶数,保证一次操作的最后一个位置的误差小于一个单元的一半。仅使用截断,向​​上舍入和向下舍入可能会导致错误,最后一个位置的误差大于一个单元的一半,但最后一个位置的误差小于一个单元,因此不建议使用这些模式,除非它们是在间隔算术中使用。

6。摘要

简而言之,浮点运算中错误的根本原因是硬件的截断和除法时的倒数截断的组合。由于IEEE-754标准在一次操作中只要求最后一个位置的误差小于一个单元的一半,因此,除非纠正,否则重复操作中的浮点错误将加起来。

A2:

A Hardware Designer's Perspective

I believe I should add a hardware designer’s perspective to this since I design and build floating point hardware. Knowing the origin of the error may help in understanding what is happening in the software, and ultimately, I hope this helps explain the reasons for why floating point errors happen and seem to accumulate over time.

1. Overview

From an engineering perspective, most floating point operations will have some element of error since the hardware that does the floating point computations is only required to have an error of less than one half of one unit in the last place. Therefore, much hardware will stop at a precision that's only necessary to yield an error of less than one half of one unit in the last place for a single operation which is especially problematic in floating point division. What constitutes a single operation depends upon how many operands the unit takes. For most, it is two, but some units take 3 or more operands. Because of this, there is no guarantee that repeated operations will result in a desirable error since the errors add up over time.

2. Standards

Most processors follow the IEEE-754 standard but some use denormalized, or different standards . For example, there is a denormalized mode in IEEE-754 which allows representation of very small floating point numbers at the expense of precision. The following, however, will cover the normalized mode of IEEE-754 which is the typical mode of operation.

In the IEEE-754 standard, hardware designers are allowed any value of error/epsilon as long as it's less than one half of one unit in the last place, and the result only has to be less than one half of one unit in the last place for one operation. This explains why when there are repeated operations, the errors add up. For IEEE-754 double precision, this is the 54th bit, since 53 bits are used to represent the numeric part (normalized), also called the mantissa, of the floating point number (e.g. the 5.3 in 5.3e5). The next sections go into more detail on the causes of hardware error on various floating point operations.

3. Cause of Rounding Error in Division

The main cause of the error in floating point division is the division algorithms used to calculate the quotient. Most computer systems calculate division using multiplication by an inverse, mainly in Z=X/Y, Z = X * (1/Y). A division is computed iteratively i.e. each cycle computes some bits of the quotient until the desired precision is reached, which for IEEE-754 is anything with an error of less than one unit in the last place. The table of reciprocals of Y (1/Y) is known as the quotient selection table (QST) in the slow division, and the size in bits of the quotient selection table is usually the width of the radix, or a number of bits of the quotient computed in each iteration, plus a few guard bits. For the IEEE-754 standard, double precision (64-bit), it would be the size of the radix of the divider, plus a few guard bits k, where k>=2. So for example, a typical Quotient Selection Table for a divider that computes 2 bits of the quotient at a time (radix 4) would be 2+2= 4 bits (plus a few optional bits).

3.1 Division Rounding Error: Approximation of Reciprocal

What reciprocals are in the quotient selection table depend on the division method: slow division such as SRT division, or fast division such as Goldschmidt division; each entry is modified according to the division algorithm in an attempt to yield the lowest possible error. In any case, though, all reciprocals are approximations of the actual reciprocal and introduce some element of error. Both slow division and fast division methods calculate the quotient iteratively, i.e. some number of bits of the quotient are calculated each step, then the result is subtracted from the dividend, and the divider repeats the steps until the error is less than one half of one unit in the last place. Slow division methods calculate a fixed number of digits of the quotient in each step and are usually less expensive to build, and fast division methods calculate a variable number of digits per step and are usually more expensive to build. The most important part of the division methods is that most of them rely upon repeated multiplication by an approximation of a reciprocal, so they are prone to error.

4. Rounding Errors in Other Operations: Truncation

Another cause of the rounding errors in all operations are the different modes of truncation of the final answer that IEEE-754 allows. There's truncate, round-towards-zero, round-to-nearest (default), round-down, and round-up. All methods introduce an element of error of less than one unit in the last place for a single operation. Over time and repeated operations, truncation also adds cumulatively to the resultant error. This truncation error is especially problematic in exponentiation, which involves some form of repeated multiplication.

5. Repeated Operations

Since the hardware that does the floating point calculations only needs to yield a result with an error of less than one half of one unit in the last place for a single operation, the error will grow over repeated operations if not watched. This is the reason that in computations that require a bounded error, mathematicians use methods such as using the round-to-nearest even digit in the last place of IEEE-754, because, over time, the errors are more likely to cancel each other out, and Interval Arithmetic combined with variations of the IEEE 754 rounding modes to predict rounding errors, and correct them. Because of its low relative error compared to other rounding modes, round to nearest even digit (in the last place), is the default rounding mode of IEEE-754.

Note that the default rounding mode, round-to-nearest even digit in the last place, guarantees an error of less than one half of one unit in the last place for one operation. Using the truncation, round-up, and round down alone may result in an error that is greater than one half of one unit in the last place, but less than one unit in the last place, so these modes are not recommended unless they are used in Interval Arithmetic.

6. Summary

In short, the fundamental reason for the errors in floating point operations is a combination of the truncation in hardware, and the truncation of a reciprocal in the case of division. Since the IEEE-754 standard only requires an error of less than one half of one unit in the last place for a single operation, the floating point errors over repeated operations will add up unless corrected.

回答3:

将.1或1/10转换为以2为底的数字(二进制​​)时,会在小数点后得到一个重复模式,就像试图以10为底的1/3表示该值一样。因此,该值不精确。不能使用普通的浮点方法对其进行精确的数学运算。

A3:

When you convert .1 or 1/10 to base 2 (binary) you get a repeating pattern after the decimal point, just like trying to represent 1/3 in base 10. The value is not exact, and therefore you can't do exact math with it using normal floating point methods.

回答4:

此处的大多数答案都以非常干燥的技术术语来解决此问题。我想用普通人能理解的术语来解决这个问题。

想象一下您要切比萨饼。您有一个自动比萨饼切割器,可以精确切成两半。它可以将整个披萨减半,也可以将现有的薄片减半,但是无论如何,减半总是精确的。

那个披萨切刀动作非常好,如果您从整个披萨开始,则将其切成两半,并继续将最小的薄片切成两半,您可以在切片前将其切成 53次对于它的高精度功能来说,它太小了。那时,您不能再将这一薄片减半,而必须按原样包含或排除它。

现在,您如何将所有的薄片切成这样的厚度,使它们的总和成为比萨饼的十分之一(0.1)或五分之一(0.2)?真正考虑一下,然后尝试解决。如果您手边有神话般的精密比萨切肉机,甚至可以尝试使用真正的比萨饼。 :-)


当然,最有经验的程序员才知道真正的答案,那就是无论您切得多么细,都无法用这些薄片将精确比萨饼的十分之一或五分之一拼凑在一起他们。您可以做一个非常好的近似值,如果您将0.1的近似值与0.2的近似值相加,那么您将获得0.3的非常好的近似值,但是仍然只是一个近似值。

对于双精度数字(这是使您的比萨饼减半53的精度),立即小于或大于0.1的数字为0.09999999999999999167332731531132594682276248931884765625和0.1000000000000000055511151231257827021181583404541015625。后者比前者更接近0.1,因此,在输入为0.1的情况下,数值解析器将偏爱后者。

(这两个数字之间的差是我们必须决定包括在内的"最小切片",这会导致向上的偏差,或者排除在外的情况会导致向下的偏差。最小切片的技术术语是 ulp 。)

在0.2的情况下,数字都是相同的,只是放大了2倍。同样,我们赞成略高于0.2的值。

请注意,在两种情况下,0.1和0.2的近似值都有轻微的向上偏差。如果我们添加足够的这些偏差,它们将使数字离我们想要的越来越远,实际上,在0.1 + 0.2的情况下,偏差足够大,以致所得的数字不再是最接近的数字到0.3。

特别是0.1 + 0.2实际上是0.100000000000000005555551151231257827021181583404541015625 + 0.200000000000000011102230246251565404236316680908203125 = 0.3000000000000000444089209850062616169452667236328125 =最接近0.3的数字实际上是0.29999999999999998889776975374843459576368331991796875。


P.S。某些编程语言还提供了比萨饼切割器,可以将切片切成十分之一。尽管这种披萨切刀并不常见,但是如果您确实有机会使用它,那么在重要的是要精确获得十分之一或五分之一的切片时,应该使用它。

(最初发布在Quora上。)

A4:

Most answers here address this question in very dry, technical terms. I'd like to address this in terms that normal human beings can understand.

Imagine that you are trying to slice up pizzas. You have a robotic pizza cutter that can cut pizza slices exactly in half. It can halve a whole pizza, or it can halve an existing slice, but in any case, the halving is always exact.

That pizza cutter has very fine movements, and if you start with a whole pizza, then halve that, and continue halving the smallest slice each time, you can do the halving 53 times before the slice is too small for even its high-precision abilities. At that point, you can no longer halve that very thin slice, but must either include or exclude it as is.

Now, how would you piece all the slices in such a way that would add up to one-tenth (0.1) or one-fifth (0.2) of a pizza? Really think about it, and try working it out. You can even try to use a real pizza, if you have a mythical precision pizza cutter at hand. :-)


Most experienced programmers, of course, know the real answer, which is that there is no way to piece together an exact tenth or fifth of the pizza using those slices, no matter how finely you slice them. You can do a pretty good approximation, and if you add up the approximation of 0.1 with the approximation of 0.2, you get a pretty good approximation of 0.3, but it's still just that, an approximation.

For double-precision numbers (which is the precision that allows you to halve your pizza 53 times), the numbers immediately less and greater than 0.1 are 0.09999999999999999167332731531132594682276248931884765625 and 0.1000000000000000055511151231257827021181583404541015625. The latter is quite a bit closer to 0.1 than the former, so a numeric parser will, given an input of 0.1, favour the latter.

(The difference between those two numbers is the "smallest slice" that we must decide to either include, which introduces an upward bias, or exclude, which introduces a downward bias. The technical term for that smallest slice is an ulp.)

In the case of 0.2, the numbers are all the same, just scaled up by a factor of 2. Again, we favour the value that's slightly higher than 0.2.

Notice that in both cases, the approximations for 0.1 and 0.2 have a slight upward bias. If we add enough of these biases in, they will push the number further and further away from what we want, and in fact, in the case of 0.1 + 0.2, the bias is high enough that the resulting number is no longer the closest number to 0.3.

In particular, 0.1 + 0.2 is really 0.1000000000000000055511151231257827021181583404541015625 + 0.200000000000000011102230246251565404236316680908203125 = 0.3000000000000000444089209850062616169452667236328125, whereas the number closest to 0.3 is actually 0.299999999999999988897769753748434595763683319091796875.


P.S. Some programming languages also provide pizza cutters that can split slices into exact tenths. Although such pizza cutters are uncommon, if you do have access to one, you should use it when it's important to be able to get exactly one-tenth or one-fifth of a slice.

(Originally posted on Quora.)

回答5:

浮点数舍入错误。由于缺少素数5,所以0.1在base-2中不能像在base-10中那样精确地表示。正如1/3可以用无数个数字来表示十进制一样,而在base-3中是" 0.1", 0.1在base-2中采用无数位数,而在base-10中则采用无数位数。而且计算机没有无限的内存量。

A5:

Floating point rounding errors. 0.1 cannot be represented as accurately in base-2 as in base-10 due to the missing prime factor of 5. Just as 1/3 takes an infinite number of digits to represent in decimal, but is "0.1" in base-3, 0.1 takes an infinite number of digits in base-2 where it does not in base-10. And computers don't have an infinite amount of memory.

回答6:

除了其他正确答案外,您可能还需要考虑缩放值以避免浮点运算出现问题。

例如:

var result = 1.0 + 2.0;     // result === 3.0 returns true

...而不是:

var result = 0.1 + 0.2;     // result === 0.3 returns false

表达式0.1+0.2===0.3在JavaScript中返回false,但是幸运的是浮点数中的整数运算是精确的,因此可以通过以下方式避免小数表示错误:缩放。

作为一个实际示例,为避免精度至关重要的浮点问题,建议 1 将货币作为代表美分数量的整数来处理:2550分而不是25.50美元。


1 道格拉斯·克罗克福德:"> JavaScript:好的部分:附录A-糟糕的部分(第105页)

A6:

In addition to the other correct answers, you may want to consider scaling your values to avoid problems with floating-point arithmetic.

For example:

var result = 1.0 + 2.0;     // result === 3.0 returns true

... instead of:

var result = 0.1 + 0.2;     // result === 0.3 returns false

The expression 0.1 + 0.2 === 0.3 returns false in JavaScript, but fortunately integer arithmetic in floating-point is exact, so decimal representation errors can be avoided by scaling.

As a practical example, to avoid floating-point problems where accuracy is paramount, it is recommended1 to handle money as an integer representing the number of cents: 2550 cents instead of 25.50 dollars.


1 Douglas Crockford: JavaScript: The Good Parts: Appendix A - Awful Parts (page 105).

回答7:

我的回答很长,因此我将其分为三个部分。由于问题是关于浮点数学的,因此我将重点放在机器的实际功能上。我还专门针对双精度(64位)精度,但该参数同样适用于任何浮点运算。

序言

IEEE 754双精度二进制浮点格式(binary64)数字表示形式的数字

value =(-1)^ s *(1.m 51 m 50 ... m 2 m 1 m 0 ) 2 * 2 e-1023

64位:

  • 第一位是签名位1(如果数字为负,0否则为 1
  • 接下来的11位是指数,即偏移量乘以1023。换句话说,从双精度数中读取指数位后,必须减去1023才能获得两个。
  • 剩余的52位是有效数(或尾数)。在尾数中,"隐含的" 1.总是 2 被省略,因为任何二进制值的最高有效位是1

1 -IEEE 754允许使用签名零的概念-+0-0的区别对待:1//(+0)是正无穷大; 1/(-0)是负无穷大。对于零值,尾数和指数位均为零。注意:零值(+0和-0)明确未归类为denormal 2

2 -反常数并非如此,偏移指数为零(隐含的0.)。异常双精度数的范围是d min ≤| x | ≤d max ,其中d min (最小的可表示非零数字)为2 -1023-51 (≈4.94 * 10 - 324 )和d max (最大的非正规数,尾数完全由1 s组成)是2 -1023 + 1 -2 -1023-51 (≈2.225 * 10 -308 )。


将双精度数字转换为二进制

存在许多在线转换器,用于将双精度浮点数转换为二进制数(例如,在 binaryconvert.com ),但这是一些示例C#代码,用于获取双精度数字的IEEE 754表示形式(我用冒号(:)分隔了三个部分:

public static string BinaryRepresentation(double value)
{
    long valueInLongType = BitConverter.DoubleToInt64Bits(value);
    string bits = Convert.ToString(valueInLongType, 2);
    string leadingZeros = new string('0', 64 - bits.Length);
    string binaryRepresentation = leadingZeros + bits;

    string sign = binaryRepresentation[0].ToString();
    string exponent = binaryRepresentation.Substring(1, 11);
    string mantissa = binaryRepresentation.Substring(12);

    return string.Format("{0}:{1}:{2}", sign, exponent, mantissa);
}

关键点:原始问题

(跳至TL; DR版本的底部)

卡托·约翰斯顿(提问者)问为什么要0.1 + 0.2!= 0.3。

IEEE 754以二进制形式(用冒号分隔三个部分)编写,其值表示为:

0.1 => 0:01111111011:1001100110011001100110011001100110011001100110011010
0.2 => 0:01111111100:1001100110011001100110011001100110011001100110011010

请注意,尾数由0011的重复数字组成。这是关键的原因,为什么计算有错误-0.1、0.2和0.3不能以有限个数精确地用二进制精确地表示。可以使用十进制数字精确地表示大于1 / 9、1 / 3或1/7的二进制位。

还要注意,我们可以将指数的幂减小52,并将二进制表示形式的点向右移动52个位置(很像10 -3 * 1.23 == 10 -5 * 123)。然后,这使我们能够将二进制表示形式表示为它以* 2 p 形式表示的精确值。其中" a"是整数。

将指数转换为十进制,除去偏移,然后重新添加隐含的1(在方括号中),则0.1和0.2为:

0.1 => 2^-4 * [1].1001100110011001100110011001100110011001100110011010
0.2 => 2^-3 * [1].1001100110011001100110011001100110011001100110011010
or
0.1 => 2^-56 * 7205759403792794 = 0.1000000000000000055511151231257827021181583404541015625
0.2 => 2^-55 * 7205759403792794 = 0.200000000000000011102230246251565404236316680908203125

要添加两个数字,指数必须相同,即:

0.1 => 2^-3 *  0.1100110011001100110011001100110011001100110011001101(0)
0.2 => 2^-3 *  1.1001100110011001100110011001100110011001100110011010
sum =  2^-3 * 10.0110011001100110011001100110011001100110011001100111
or
0.1 => 2^-55 * 3602879701896397  = 0.1000000000000000055511151231257827021181583404541015625
0.2 => 2^-55 * 7205759403792794  = 0.200000000000000011102230246251565404236316680908203125
sum =  2^-55 * 10808639105689191 = 0.3000000000000000166533453693773481063544750213623046875

由于总和的格式不是2 n * 1. {bbb},因此我们将指数增加1并将小数点( binary )移位为:

sum = 2^-2  * 1.0011001100110011001100110011001100110011001100110011(1)
    = 2^-54 * 5404319552844595.5 = 0.3000000000000000166533453693773481063544750213623046875

现在尾数中有53位(第53位在上一行的方括号中)。 IEEE 754的默认舍入模式为'舍入到最近" '-即,如果数字 x 介于两个值 a b 之间,则最低有效位为零的值为选择。

a = 2^-54 * 5404319552844595 = 0.299999999999999988897769753748434595763683319091796875
  = 2^-2  * 1.0011001100110011001100110011001100110011001100110011

x = 2^-2  * 1.0011001100110011001100110011001100110011001100110011(1)

b = 2^-2  * 1.0011001100110011001100110011001100110011001100110100
  = 2^-54 * 5404319552844596 = 0.3000000000000000444089209850062616169452667236328125

请注意, a b 仅在最后一位不同; ...0011 + 1 = ...0100。在这种情况下,最低有效位为零的值为 b ,因此总和为:

sum = 2^-2  * 1.0011001100110011001100110011001100110011001100110100
    = 2^-54 * 5404319552844596 = 0.3000000000000000444089209850062616169452667236328125

而0.3的二进制表示形式是:

0.3 => 2^-2  * 1.0011001100110011001100110011001100110011001100110011
    =  2^-54 * 5404319552844595 = 0.299999999999999988897769753748434595763683319091796875

与0.1和0.2之和的二进制表示形式仅相差2 -54

0.1和0.2的二进制表示形式是IEEE 754允许的数字的最准确的表示形式。由于默认的舍入模式,这些表示形式的加法得出的值仅不同最低有效位。

TL; DR

以IEEE 754二进制表示形式(用冒号分隔三个部分)编写0.1+0.2,并将其与0.3进行比较,这是(我已经将方括号中的位):

0.1 + 0.2 => 0:01111111101:0011001100110011001100110011001100110011001100110[100]
0.3       => 0:01111111101:0011001100110011001100110011001100110011001100110[011]

转换回十进制,这些值为:

0.1 + 0.2 => 0.300000000000000044408920985006...
0.3       => 0.299999999999999988897769753748...

差异恰好是2 -54 ,约为〜5.5511151231258×10 -17 -与原始值相比微不足道(对于许多应用程序而言)。

比较浮点数的最后几位本质上很危险,因为任何人读著名的" 每位计算机科学家应该知道的有关浮点算法的信息"(涵盖了该答案的所有主要部分)。

大多数计算器使用其他后备数字来解决此问题,这是0.1 + 0.2 将给出0.3:最后几位被四舍五入。

A7:

My answer is quite long, so I've split it into three sections. Since the question is about floating point mathematics, I've put the emphasis on what the machine actually does. I've also made it specific to double (64 bit) precision, but the argument applies equally to any floating point arithmetic.

Preamble

An IEEE 754 double-precision binary floating-point format (binary64) number represents a number of the form

value = (-1)^s * (1.m51m50...m2m1m0)2 * 2e-1023

in 64 bits:

  • The first bit is the sign bit: 1 if the number is negative, 0 otherwise1.
  • The next 11 bits are the exponent, which is offset by 1023. In other words, after reading the exponent bits from a double-precision number, 1023 must be subtracted to obtain the power of two.
  • The remaining 52 bits are the significand (or mantissa). In the mantissa, an 'implied' 1. is always2 omitted since the most significant bit of any binary value is 1.

1 - IEEE 754 allows for the concept of a signed zero - +0 and -0 are treated differently: 1 / (+0) is positive infinity; 1 / (-0) is negative infinity. For zero values, the mantissa and exponent bits are all zero. Note: zero values (+0 and -0) are explicitly not classed as denormal2.

2 - This is not the case for denormal numbers, which have an offset exponent of zero (and an implied 0.). The range of denormal double precision numbers is dmin ≤ |x| ≤ dmax, where dmin (the smallest representable nonzero number) is 2-1023 - 51 (≈ 4.94 * 10-324) and dmax (the largest denormal number, for which the mantissa consists entirely of 1s) is 2-1023 + 1 - 2-1023 - 51 (≈ 2.225 * 10-308).


Turning a double precision number to binary

Many online converters exist to convert a double precision floating point number to binary (e.g. at binaryconvert.com), but here is some sample C# code to obtain the IEEE 754 representation for a double precision number (I separate the three parts with colons (:):

public static string BinaryRepresentation(double value)
{
    long valueInLongType = BitConverter.DoubleToInt64Bits(value);
    string bits = Convert.ToString(valueInLongType, 2);
    string leadingZeros = new string('0', 64 - bits.Length);
    string binaryRepresentation = leadingZeros + bits;

    string sign = binaryRepresentation[0].ToString();
    string exponent = binaryRepresentation.Substring(1, 11);
    string mantissa = binaryRepresentation.Substring(12);

    return string.Format("{0}:{1}:{2}", sign, exponent, mantissa);
}

Getting to the point: the original question

(Skip to the bottom for the TL;DR version)

Cato Johnston (the question asker) asked why 0.1 + 0.2 != 0.3.

Written in binary (with colons separating the three parts), the IEEE 754 representations of the values are:

0.1 => 0:01111111011:1001100110011001100110011001100110011001100110011010
0.2 => 0:01111111100:1001100110011001100110011001100110011001100110011010

Note that the mantissa is composed of recurring digits of 0011. This is key to why there is any error to the calculations - 0.1, 0.2 and 0.3 cannot be represented in binary precisely in a finite number of binary bits any more than 1/9, 1/3 or 1/7 can be represented precisely in decimal digits.

Also note that we can decrease the power in the exponent by 52 and shift the point in the binary representation to the right by 52 places (much like 10-3 * 1.23 == 10-5 * 123). This then enables us to represent the binary representation as the exact value that it represents in the form a * 2p. where 'a' is an integer.

Converting the exponents to decimal, removing the offset, and re-adding the implied 1 (in square brackets), 0.1 and 0.2 are:

0.1 => 2^-4 * [1].1001100110011001100110011001100110011001100110011010
0.2 => 2^-3 * [1].1001100110011001100110011001100110011001100110011010
or
0.1 => 2^-56 * 7205759403792794 = 0.1000000000000000055511151231257827021181583404541015625
0.2 => 2^-55 * 7205759403792794 = 0.200000000000000011102230246251565404236316680908203125

To add two numbers, the exponent needs to be the same, i.e.:

0.1 => 2^-3 *  0.1100110011001100110011001100110011001100110011001101(0)
0.2 => 2^-3 *  1.1001100110011001100110011001100110011001100110011010
sum =  2^-3 * 10.0110011001100110011001100110011001100110011001100111
or
0.1 => 2^-55 * 3602879701896397  = 0.1000000000000000055511151231257827021181583404541015625
0.2 => 2^-55 * 7205759403792794  = 0.200000000000000011102230246251565404236316680908203125
sum =  2^-55 * 10808639105689191 = 0.3000000000000000166533453693773481063544750213623046875

Since the sum is not of the form 2n * 1.{bbb} we increase the exponent by one and shift the decimal (binary) point to get:

sum = 2^-2  * 1.0011001100110011001100110011001100110011001100110011(1)
    = 2^-54 * 5404319552844595.5 = 0.3000000000000000166533453693773481063544750213623046875

There are now 53 bits in the mantissa (the 53rd is in square brackets in the line above). The default rounding mode for IEEE 754 is 'Round to Nearest' - i.e. if a number x falls between two values a and b, the value where the least significant bit is zero is chosen.

a = 2^-54 * 5404319552844595 = 0.299999999999999988897769753748434595763683319091796875
  = 2^-2  * 1.0011001100110011001100110011001100110011001100110011

x = 2^-2  * 1.0011001100110011001100110011001100110011001100110011(1)

b = 2^-2  * 1.0011001100110011001100110011001100110011001100110100
  = 2^-54 * 5404319552844596 = 0.3000000000000000444089209850062616169452667236328125

Note that a and b differ only in the last bit; ...0011 + 1 = ...0100. In this case, the value with the least significant bit of zero is b, so the sum is:

sum = 2^-2  * 1.0011001100110011001100110011001100110011001100110100
    = 2^-54 * 5404319552844596 = 0.3000000000000000444089209850062616169452667236328125

whereas the binary representation of 0.3 is:

0.3 => 2^-2  * 1.0011001100110011001100110011001100110011001100110011
    =  2^-54 * 5404319552844595 = 0.299999999999999988897769753748434595763683319091796875

which only differs from the binary representation of the sum of 0.1 and 0.2 by 2-54.

The binary representation of 0.1 and 0.2 are the most accurate representations of the numbers allowable by IEEE 754. The addition of these representation, due to the default rounding mode, results in a value which differs only in the least-significant-bit.

TL;DR

Writing 0.1 + 0.2 in a IEEE 754 binary representation (with colons separating the three parts) and comparing it to 0.3, this is (I've put the distinct bits in square brackets):

0.1 + 0.2 => 0:01111111101:0011001100110011001100110011001100110011001100110[100]
0.3       => 0:01111111101:0011001100110011001100110011001100110011001100110[011]

Converted back to decimal, these values are:

0.1 + 0.2 => 0.300000000000000044408920985006...
0.3       => 0.299999999999999988897769753748...

The difference is exactly 2-54, which is ~5.5511151231258 × 10-17 - insignificant (for many applications) when compared to the original values.

Comparing the last few bits of a floating point number is inherently dangerous, as anyone who reads the famous "What Every Computer Scientist Should Know About Floating-Point Arithmetic" (which covers all the major parts of this answer) will know.

Most calculators use additional guard digits to get around this problem, which is how 0.1 + 0.2 would give 0.3: the final few bits are rounded.

回答8:

计算机中存储的浮点数由两部分组成,一个整数和一个以整数为底并乘以该整数的指数。

如果计算机以10为基数运行,则0.1将为1x10×¹0.2将为2x10⁻¹0.3将为3x10⁻¹。整数数学既简单又精确,因此添加0.1+0.2显然会导致0.3

计算机通常不以10为基础,而以2为基础。您仍然可以获得某些值的精确结果,例如0.51x21¹0.251x2⁻²,将它们相加会得到3x2⁻²0.75。没错。

问题在于数字可以精确地以10为底,而不能以2为底。这些数字需要四舍五入到最接近的等值。假设使用非常常见的IEEE 64位浮点格式,则最接近0.1的数字是3602879701896397x2⁻⁵⁵,最接近数字的是0.27205759403792794x2⁻⁵⁵;将它们加在一起将得到10808639105689191x2⁻⁵⁵,或者精确的十进制值0.3000000000000000444089209850062616169452667236328125。浮点数通常会四舍五入以显示。

A8:

Floating point numbers stored in the computer consist of two parts, an integer and an exponent that the base is taken to and multiplied by the integer part.

If the computer were working in base 10, 0.1 would be 1 x 10⁻¹, 0.2 would be 2 x 10⁻¹, and 0.3 would be 3 x 10⁻¹. Integer math is easy and exact, so adding 0.1 + 0.2 will obviously result in 0.3.

Computers don't usually work in base 10, they work in base 2. You can still get exact results for some values, for example 0.5 is 1 x 2⁻¹ and 0.25 is 1 x 2⁻², and adding them results in 3 x 2⁻², or 0.75. Exactly.

The problem comes with numbers that can be represented exactly in base 10, but not in base 2. Those numbers need to be rounded to their closest equivalent. Assuming the very common IEEE 64-bit floating point format, the closest number to 0.1 is 3602879701896397 x 2⁻⁵⁵, and the closest number to 0.2 is 7205759403792794 x 2⁻⁵⁵; adding them together results in 10808639105689191 x 2⁻⁵⁵, or an exact decimal value of 0.3000000000000000444089209850062616169452667236328125. Floating point numbers are generally rounded for display.

回答9:

浮点舍入错误。摘自每个计算机科学家应该了解的浮点算法

将无限多个实数压缩为有限数量的位需要近似表示。尽管有无限多个整数,但是在大多数程序中,整数计算的结果可以存储在32位中。相反,在给定固定位数的情况下,大多数使用实数的计算将产生无法使用那么多位数精确表示的数量。因此,浮点计算的结果通常必须四舍五入,以使其适合其有限表示。舍入误差是浮点计算的特征。

A9:

Floating point rounding error. From What Every Computer Scientist Should Know About Floating-Point Arithmetic:

Squeezing infinitely many real numbers into a finite number of bits requires an approximate representation. Although there are infinitely many integers, in most programs the result of integer computations can be stored in 32 bits. In contrast, given any fixed number of bits, most calculations with real numbers will produce quantities that cannot be exactly represented using that many bits. Therefore the result of a floating-point calculation must often be rounded in order to fit back into its finite representation. This rounding error is the characteristic feature of floating-point computation.

回答10:

我的解决方法:

function add(a, b, precision) {
    var x = Math.pow(10, precision || 2);
    return (Math.round(a * x) + Math.round(b * x)) / x;
}

精度是指在加法期间要保留的小数点后的位数。

A10:

My workaround:

function add(a, b, precision) {
    var x = Math.pow(10, precision || 2);
    return (Math.round(a * x) + Math.round(b * x)) / x;
}

precision refers to the number of digits you want to preserve after the decimal point during addition.

回答11:

已经发布了很多好的答案,但我想再追加一个。

不是所有数字都可以通过 floats / doubles 来表示。例如,在IEEE754浮点标准中,数字" 0.2"将以单精度表示为" 0.200000003"

引擎盖下存储实数的模型将浮点数表示为

在此处输入图片描述

即使您可以轻松键入0.2FLT_RADIXDBL_RADIX均为2;对于具有FPU且使用" IEEE二进制浮点算术标准(ISO / IEEE Std 754-1985)"的计算机,则不是10。

因此,要准确地表示这些数字有点困难。即使您明确指定此变量,也无需进行任何中间计算。

A11:

A lot of good answers have been posted, but I'd like to append one more.

Not all numbers can be represented via floats/doubles For example, the number "0.2" will be represented as "0.200000003" in single precision in IEEE754 float point standard.

Model for store real numbers under the hood represent float numbers as

Even though you can type 0.2 easily, FLT_RADIX and DBL_RADIX is 2; not 10 for a computer with FPU which uses "IEEE Standard for Binary Floating-Point Arithmetic (ISO/IEEE Std 754-1985)".

So it is a bit hard to represent such numbers exactly. Even if you specify this variable explicitly without any intermediate calculation.

回答12:

一些与此著名的双精度问题有关的统计数据。

当使用0.1(从0.1到100)的步长将所有值( a + b )相加时,我们有〜15%的精度误差机会。请注意,该错误可能导致值稍大或更小。以下是一些示例:

0.1 + 0.2 = 0.30000000000000004 (BIGGER)
0.1 + 0.7 = 0.7999999999999999 (SMALLER)
...
1.7 + 1.9 = 3.5999999999999996 (SMALLER)
1.7 + 2.2 = 3.9000000000000004 (BIGGER)
...
3.2 + 3.6 = 6.800000000000001 (BIGGER)
3.2 + 4.4 = 7.6000000000000005 (BIGGER)

以0.1(从100到0.1)的步长减去所有值( a-b 其中 a> b )时,我们有〜34%的机会精度错误。以下是一些示例:

0.6 - 0.2 = 0.39999999999999997 (SMALLER)
0.5 - 0.4 = 0.09999999999999998 (SMALLER)
...
2.1 - 0.2 = 1.9000000000000001 (BIGGER)
2.0 - 1.9 = 0.10000000000000009 (BIGGER)
...
100 - 99.9 = 0.09999999999999432 (SMALLER)
100 - 99.8 = 0.20000000000000284 (BIGGER)

* 15%和34%的确很大,因此在精度至关重要时,请始终使用BigDecimal。如果使用两位小数(第0.01步),情况会进一步恶化(分别为18%和36%)。

A12:

Some statistics related to this famous double precision question.

When adding all values (a + b) using a step of 0.1 (from 0.1 to 100) we have ~15% chance of precision error. Note that the error could result in slightly bigger or smaller values. Here are some examples:

0.1 + 0.2 = 0.30000000000000004 (BIGGER)
0.1 + 0.7 = 0.7999999999999999 (SMALLER)
...
1.7 + 1.9 = 3.5999999999999996 (SMALLER)
1.7 + 2.2 = 3.9000000000000004 (BIGGER)
...
3.2 + 3.6 = 6.800000000000001 (BIGGER)
3.2 + 4.4 = 7.6000000000000005 (BIGGER)

When subtracting all values (a - b where a > b) using a step of 0.1 (from 100 to 0.1) we have ~34% chance of precision error. Here are some examples:

0.6 - 0.2 = 0.39999999999999997 (SMALLER)
0.5 - 0.4 = 0.09999999999999998 (SMALLER)
...
2.1 - 0.2 = 1.9000000000000001 (BIGGER)
2.0 - 1.9 = 0.10000000000000009 (BIGGER)
...
100 - 99.9 = 0.09999999999999432 (SMALLER)
100 - 99.8 = 0.20000000000000284 (BIGGER)

*15% and 34% are indeed huge, so always use BigDecimal when precision is of big importance. With 2 decimal digits (step 0.01) the situation worsens a bit more (18% and 36%).

回答13:

否,没有破损,但大多数十进制小数必须近似

摘要

不幸的是,

浮点算术 是精确的,与我们通常的以10为底的数字表示形式不匹配,因此事实证明,我们经常给它输入的内容与实际输入略有出入我们写了。

即使简单的数字(如0.01、0.02、0.03、0.04 ... 0.24)也不能完全表示为二进制分数。如果您将0.01,.02,.03 ...相加,则直到达到0.25时,您才能获得以base 2 表示的第一个分数。如果您尝试使用FP,则0.01的值会略有偏差,因此将25个值相加达到精确的0.25的唯一方法是需要一长串因果关系,其中包括保护位和舍入。很难预测,所以我们举起手来说" FP是不精确的" ,但这不是真的。

我们不断地给FP硬件一些东西,这些东西在base 10中似乎很简单,但是在base 2中却是重复的部分。

这是怎么发生的?

当我们用十进制表示时,每个分数(特别是每个终止十进制)都是

形式的有理数

a /(2 n x 5 m )

在二进制中,我们仅得到 2 n 项,即:

a / 2 n

因此,以十进制表示,我们不能表示 1 / 3 。由于以10为底的素数包含2作为素数,因此我们可以写为二进制分数 的每个数字都可以以10为底分数。但是,我们写成base 10 分数的东西几乎都不能用二进制表示。在0.01、0.02、0.03 ... 0.99的范围内,我们的FP格式只能表示三个数字:0.25、0.50和0.75,因为它们分别是1 / 4、1 / 2,和3/4,所有具有质数的数字仅使用2 n 项。

在base 10 中,我们不能表示 1 / 3 。但是在二进制文件中,我们不能执行 1 / 10 1 / 3

因此,虽然每个二进制分数都可以用十进制表示,但事实并非如此。实际上,大多数十进制小数都以二进制重复。

处理

通常会指示开发人员进行 比较,更好的建议可能是四舍五入为整数值(在C库中:round()和roundf(),即保持FP格式),然后进行比较。四舍五入为特定的小数部分长度可以解决大多数输出​​问题。

此外,在实数运算问题(FP是在价格昂贵的早期计算机上发明的问题)中,宇宙的物理常数和所有其他度量值仅由相对较少的有效数字知道,因此无论如何,整个问题空间都是" inexact"。在这种应用程序中,FP"准确性"不是问题。

当人们尝试使用FP进行豆计数时,确实会出现整个问题。它确实可以做到这一点,但前提是您坚持使用整数值,这会破坏使用它的意义。 这就是为什么我们拥有所有这些小数部分软件库的原因。

我喜欢 Chris 的比萨答案,因为它描述了实际的问题,而不仅仅是通常的问题挥舞着"不准确性"。如果FP只是"不准确",我们可以修复并且几十年前就可以做到。我们之所以没有这个原因,是因为FP格式紧凑,快速,并且是处理大量数字的最佳方法。而且,这是航天时代和军备竞赛的遗留下来的,也是早期尝试解决使用小型内存系统的非常慢的计算机来解决大问题的尝试。 (有时,单个的磁芯用于1位存储,但这是另一个故事)

结论

如果您只是在银行里数豆,那么首先使用十进制字符串表示形式的软件解决方案就可以很好地工作。但是您不能那样做量子色动力学或空气动力学。

A13:

No, not broken, but most decimal fractions must be approximated

Summary

Floating point arithmetic is exact, unfortunately, it doesn't match up well with our usual base-10 number representation, so it turns out we are often giving it input that is slightly off from what we wrote.

Even simple numbers like 0.01, 0.02, 0.03, 0.04 ... 0.24 are not representable exactly as binary fractions. If you count up 0.01, .02, .03 ..., not until you get to 0.25 will you get the first fraction representable in base2. If you tried that using FP, your 0.01 would have been slightly off, so the only way to add 25 of them up to a nice exact 0.25 would have required a long chain of causality involving guard bits and rounding. It's hard to predict so we throw up our hands and say "FP is inexact", but that's not really true.

We constantly give the FP hardware something that seems simple in base 10 but is a repeating fraction in base 2.

How did this happen?

When we write in decimal, every fraction (specifically, every terminating decimal) is a rational number of the form

a / (2n x 5m)

In binary, we only get the 2n term, that is:

a / 2n

So in decimal, we can't represent 1/3. Because base 10 includes 2 as a prime factor, every number we can write as a binary fraction also can be written as a base 10 fraction. However, hardly anything we write as a base10 fraction is representable in binary. In the range from 0.01, 0.02, 0.03 ... 0.99, only three numbers can be represented in our FP format: 0.25, 0.50, and 0.75, because they are 1/4, 1/2, and 3/4, all numbers with a prime factor using only the 2n term.

In base10 we can't represent 1/3. But in binary, we can't do 1/10 or 1/3.

So while every binary fraction can be written in decimal, the reverse is not true. And in fact most decimal fractions repeat in binary.

Dealing with it

Developers are usually instructed to do < epsilon comparisons, better advice might be to round to integral values (in the C library: round() and roundf(), i.e., stay in the FP format) and then compare. Rounding to a specific decimal fraction length solves most problems with output.

Also, on real number-crunching problems (the problems that FP was invented for on early, frightfully expensive computers) the physical constants of the universe and all other measurements are only known to a relatively small number of significant figures, so the entire problem space was "inexact" anyway. FP "accuracy" isn't a problem in this kind of application.

The whole issue really arises when people try to use FP for bean counting. It does work for that, but only if you stick to integral values, which kind of defeats the point of using it. This is why we have all those decimal fraction software libraries.

I love the Pizza answer by Chris, because it describes the actual problem, not just the usual handwaving about "inaccuracy". If FP were simply "inaccurate", we could fix that and would have done it decades ago. The reason we haven't is because the FP format is compact and fast and it's the best way to crunch a lot of numbers. Also, it's a legacy from the space age and arms race and early attempts to solve big problems with very slow computers using small memory systems. (Sometimes, individual magnetic cores for 1-bit storage, but that's another story.)

Conclusion

If you are just counting beans at a bank, software solutions that use decimal string representations in the first place work perfectly well. But you can't do quantum chromodynamics or aerodynamics that way.

回答14:

您尝试过胶带解决方案吗?

尝试确定何时发生错误,并使用简短的if语句修复错误,这虽然不是很漂亮,但是对于某些问题,这是唯一的解决方案,这就是其中之一。

 if( (n * 0.1) < 100.0 ) { return n * 0.1 - 0.000000000000001 ;}
                    else { return n * 0.1 + 0.000000000000001 ;}    

在c#的一个科学模拟项目中,我遇到了同样的问题,我可以告诉你,如果您忽略蝴蝶效应,它将变成一条大胖龙,并用a **咬你。

A14:

Did you try the duct tape solution?

Try to determine when errors occur and fix them with short if statements, it's not pretty but for some problems it is the only solution and this is one of them.

 if( (n * 0.1) < 100.0 ) { return n * 0.1 - 0.000000000000001 ;}
                    else { return n * 0.1 + 0.000000000000001 ;}    

I had the same problem in a scientific simulation project in c#, and I can tell you that if you ignore the butterfly effect it's gonna turn to a big fat dragon and bite you in the a**

回答15:

为了提供最佳解决方案,我可以说我发现了以下方法:

parseFloat((0.1 + 0.2).toFixed(10)) => Will return 0.3

让我解释一下为什么这是最好的解决方案。正如上面提到的其他答案一样,最好使用可立即使用的Javascript toFixed()函数来解决问题。但是您很可能会遇到一些问题。

想象一下,您将要添加两个浮点数,例如0.20.7,这里是:0.2+0.7=0.8999999999999999

您的预期结果为0.9,这意味着在这种情况下您需要1位精度的结果。因此,您应该使用(0.2+0.7).tofixed(1),但不能仅仅给toFixed()一个参数,因为它取决于给定的数字,例如

`0.22 + 0.7 = 0.9199999999999999`

在此示例中,您需要2位数字的精度,因此应为toFixed(2),那么适合每个给定浮点数的参数应该是什么?

您可能会说在每种情况下都设为10:

(0.2 + 0.7).toFixed(10) => Result will be "0.9000000000"

该死!您将如何处理9之后的那些不需要的零?现在是时候将其转换为浮动以使其如您所愿了:

parseFloat((0.2 + 0.7).toFixed(10)) => Result will be 0.9

现在您找到了解决方案,最好将其作为这样的功能提供:

function floatify(number){
           return parseFloat((number).toFixed(10));
        }

让我们自己尝试:

function floatify(number){
       return parseFloat((number).toFixed(10));
    }
 
function addUp(){
  var number1 = +$("#number1").val();
  var number2 = +$("#number2").val();
  var unexpectedResult = number1 + number2;
  var expectedResult = floatify(number1 + number2);
  $("#unexpectedResult").text(unexpectedResult);
  $("#expectedResult").text(expectedResult);
}
addUp();

您可以通过以下方式使用它:

input{
  width: 50px;
}
#expectedResult{
color: green;
}
#unexpectedResult{
color: red;
}

正如 W3SCHOOLS 所建议的那样,还有另一种解决方案,您可以乘以除法来求解上面的问题:

<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<input id="number1" value="0.2" onclick="addUp()" onkeyup="addUp()"/> +
<input id="number2" value="0.7" onclick="addUp()" onkeyup="addUp()"/> =
<p>Expected Result: <span id="expectedResult"></span></p>
<p>Unexpected Result: <span id="unexpectedResult"></span></p>

请记住,尽管看起来相同,但(0.2+0.1)*10/10根本不起作用!我更喜欢第一种解决方案,因为我可以将其用作将输入浮点数转换为准确的输出浮点数的函数。

A15:

In order to offer The best solution I can say I discovered following method:

parseFloat((0.1 + 0.2).toFixed(10)) => Will return 0.3

Let me explain why it's the best solution. As others mentioned in above answers it's a good idea to use ready to use Javascript toFixed() function to solve the problem. But most likely you'll encounter with some problems.

Imagine you are going to add up two float numbers like 0.2 and 0.7 here it is: 0.2 + 0.7 = 0.8999999999999999.

Your expected result was 0.9 it means you need a result with 1 digit precision in this case. So you should have used (0.2 + 0.7).tofixed(1) but you can't just give a certain parameter to toFixed() since it depends on the given number, for instance

`0.22 + 0.7 = 0.9199999999999999`

In this example you need 2 digits precision so it should be toFixed(2), so what should be the paramter to fit every given float number?

You might say let it be 10 in every situation then:

(0.2 + 0.7).toFixed(10) => Result will be "0.9000000000"

Damn! What are you going to do with those unwanted zeros after 9? It's the time to convert it to float to make it as you desire:

parseFloat((0.2 + 0.7).toFixed(10)) => Result will be 0.9

Now that you found the solution, it's better to offer it as a function like this:

function floatify(number){
           return parseFloat((number).toFixed(10));
        }

Let's try it yourself:

function floatify(number){
       return parseFloat((number).toFixed(10));
    }
 
function addUp(){
  var number1 = +$("#number1").val();
  var number2 = +$("#number2").val();
  var unexpectedResult = number1 + number2;
  var expectedResult = floatify(number1 + number2);
  $("#unexpectedResult").text(unexpectedResult);
  $("#expectedResult").text(expectedResult);
}
addUp();
input{
  width: 50px;
}
#expectedResult{
color: green;
}
#unexpectedResult{
color: red;
}
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<input id="number1" value="0.2" onclick="addUp()" onkeyup="addUp()"/> +
<input id="number2" value="0.7" onclick="addUp()" onkeyup="addUp()"/> =
<p>Expected Result: <span id="expectedResult"></span></p>
<p>Unexpected Result: <span id="unexpectedResult"></span></p>

You can use it this way:

var x = 0.2 + 0.7;
floatify(x);  => Result: 0.9

As W3SCHOOLS suggests there is another solution too, you can multiply and divide to solve the problem above:

var x = (0.2 * 10 + 0.1 * 10) / 10;       // x will be 0.3

Keep in mind that (0.2 + 0.1) * 10 / 10 won't work at all although it seems the same! I prefer the first solution since I can apply it as a function which converts the input float to accurate output float.

回答16:

出现这些怪异的数字是因为计算机使用二进制(基数2)数字系统进行计算,而我们使用十进制(基数10)。

大多数小数不能精确地用二进制或十进制或两种形式表示。结果-舍入(但精确)的数字结果。

A16:

Those weird numbers appear because computers use binary(base 2) number system for calculation purposes, while we use decimal(base 10).

There are a majority of fractional numbers that cannot be represented precisely either in binary or in decimal or both. Result - A rounded up (but precise) number results.

回答17:

这个问题的许多重复项都询问浮点取整对特定数字的影响。在实践中,通过查看感兴趣的计算的确切结果而不是仅仅阅读它,会更容易感觉到它是如何工作的。某些语言提供了执行此操作的方法-例如在Java中将floatdouble转换为BigDecimal

由于这是与语言无关的问题,因此需要与语言无关的工具,例如十进制到浮点转换器

将其应用于问题中的数字,视为双精度:

0.1转换为0.1000000000000000055511151231257827021181583404541015625,

0.2转换为0.200000000000000011102230246251565404236316680908203125,

0.3转换为0.299999999999999988897769753748434595763683319091796875和

0.30000000000000004转换为0.3000000000000000444089209850062616169452667236328125。

手动或使用小数计算器(例如全精度计算器)将前两个数字相加,显示实际输入的确切总和为0.3000000000000000166533453693793781063544750213613623046875。

如果将其舍入为0.3的等效值,则舍入误差为0.000000000000000027755575615628913510590791702270505078125。舍入到等效的0.30000000000000004也将舍入误差0.0000000000000000277555756156289135105907917022705078125。从头到尾的平局决胜局适用。

返回浮点转换器,0.30000000000000004的原始十六进制为3fd3333333333334,该数字以偶数结尾,因此是正确的结果。

A17:

Many of this question's numerous duplicates ask about the effects of floating point rounding on specific numbers. In practice, it is easier to get a feeling for how it works by looking at exact results of calculations of interest rather than by just reading about it. Some languages provide ways of doing that - such as converting a float or double to BigDecimal in Java.

Since this is a language-agnostic question, it needs language-agnostic tools, such as a Decimal to Floating-Point Converter.

Applying it to the numbers in the question, treated as doubles:

0.1 converts to 0.1000000000000000055511151231257827021181583404541015625,

0.2 converts to 0.200000000000000011102230246251565404236316680908203125,

0.3 converts to 0.299999999999999988897769753748434595763683319091796875, and

0.30000000000000004 converts to 0.3000000000000000444089209850062616169452667236328125.

Adding the first two numbers manually or in a decimal calculator such as Full Precision Calculator, shows the exact sum of the actual inputs is 0.3000000000000000166533453693773481063544750213623046875.

If it were rounded down to the equivalent of 0.3 the rounding error would be 0.0000000000000000277555756156289135105907917022705078125. Rounding up to the equivalent of 0.30000000000000004 also gives rounding error 0.0000000000000000277555756156289135105907917022705078125. The round-to-even tie breaker applies.

Returning to the floating point converter, the raw hexadecimal for 0.30000000000000004 is 3fd3333333333334, which ends in an even digit and therefore is the correct result.

回答18:

鉴于没有人提及此事...

一些高级语言(例如Python和Java)随附了克服二进制浮点限制的工具。例如:

  • Python的 十进制模块和Java的 BigDecimal,内部以十进制表示法表示数字(与二进制表示法相对)。两者的精度都有限,因此它们仍然容易出错,但是它们可以解决二进制浮点算术中最常见的问题。

    在处理货币时,小数非常好:十美分加二十美分始终恰好是三十美分:

    >>> 0.1 + 0.2 == 0.3
    False
    >>> Decimal('0.1') + Decimal('0.2') == Decimal('0.3')
    True
    

    Python的decimal模块基于 IEEE标准854-1987

  • Python的 fractions模块和Apache Common的 BigFraction 。两者都将有理数表示为(分子,分母)对,并且它们给出的结果可能比十进制浮点算术更准确。

这两种解决方案都不是完美的(特别是如果我们考虑性能,或者需要非常高的精度),但是它们仍然使用二进制浮点算法解决了许多问题。

A18:

Given that nobody has mentioned this...

Some high level languages such as Python and Java come with tools to overcome binary floating point limitations. For example:

  • Python's decimal module and Java's BigDecimal class, that represent numbers internally with decimal notation (as opposed to binary notation). Both have limited precision, so they are still error prone, however they solve most common problems with binary floating point arithmetic.

    Decimals are very nice when dealing with money: ten cents plus twenty cents are always exactly thirty cents:

    >>> 0.1 + 0.2 == 0.3
    False
    >>> Decimal('0.1') + Decimal('0.2') == Decimal('0.3')
    True
    

    Python's decimal module is based on IEEE standard 854-1987.

  • Python's fractions module and Apache Common's BigFraction class. Both represent rational numbers as (numerator, denominator) pairs and they may give more accurate results than decimal floating point arithmetic.

Neither of these solutions is perfect (especially if we look at performances, or if we require a very high precision), but still they solve a great number of problems with binary floating point arithmetic.

回答19:

我可以添加吗?人们总是认为这是计算机问题,但是如果您用手计数(以10为底),除非得到(1/3+1/3=2/3)=true,否则除非您可以将0.333 ...添加为0.333 ...以无穷大,就像基数2中的(1/10+2/10)!==3/10问题一样,将其截断为0.333 + 0.333 = 0.666,可能会将其四舍五入为0.667,从技术上讲也是不准确的。

以三进制计数,但三分之二不是问题-也许每只手上有15个手指的比赛都会问为什么十进制数学运算符被破坏了...

A19:

Can I just add; people always assume this to be a computer problem, but if you count with your hands (base 10), you can't get (1/3+1/3=2/3)=true unless you have infinity to add 0.333... to 0.333... so just as with the (1/10+2/10)!==3/10 problem in base 2, you truncate it to 0.333 + 0.333 = 0.666 and probably round it to 0.667 which would be also be technically inaccurate.

Count in ternary, and thirds are not a problem though - maybe some race with 15 fingers on each hand would ask why your decimal math was broken...

回答20:

可以在数字计算机中实现的那种浮点数学运算必须使用实数的近似值及其上的运算。 ( standard 版本运行了超过五十页的文档,并设有一个委员会来处理其勘误表并进一步完善。)

该近似值是不同类型近似值的混合,由于其与精确度的特定偏离方式,每种近似值都可以忽略不计或仔细考虑。在硬件和软件级别上,它还涉及到许多明显的例外情况,大多数人会假装没有注意到。

如果需要无限精度(例如,使用数字π代替许多较短的替代之一),则应编写或使用符号数学程序。

但是,如果您对有时浮点数学的值模糊并且逻辑和错误会迅速累积的想法感到满意,并且您可以编写要求和测试以允许这样做,那么您的代码就可以经常得到与您的FPU中的内容有关。

A20:

The kind of floating-point math that can be implemented in a digital computer necessarily uses an approximation of the real numbers and operations on them. (The standard version runs to over fifty pages of documentation and has a committee to deal with its errata and further refinement.)

This approximation is a mixture of approximations of different kinds, each of which can either be ignored or carefully accounted for due to its specific manner of deviation from exactitude. It also involves a number of explicit exceptional cases at both the hardware and software levels that most people walk right past while pretending not to notice.

If you need infinite precision (using the number π, for example, instead of one of its many shorter stand-ins), you should write or use a symbolic math program instead.

But if you're okay with the idea that sometimes floating-point math is fuzzy in value and logic and errors can accumulate quickly, and you can write your requirements and tests to allow for that, then your code can frequently get by with what's in your FPU.

回答21:

只是为了好玩,我按照标准C99的定义玩了浮点数的表示法,并在下面编写了代码。

代码在3个独立的组中打印浮点数的二进制表示形式

SIGN EXPONENT FRACTION

然后打印出一个总和,当以足够的精度求和时,它将显示硬件中实际存在的值。

因此,当您编写floatx=999...时,编译器将以函数xx打印的位表示形式转换该数字,以使函数yy等于给定的数字。

实际上,这个和只是一个近似值。对于数字999,999,999,编译器将在浮点数的位表示形式中插入数字1,000,000,000

在代码之后,我附加了一个控制台会话,在该会话中,我计算了硬件中确实存在的两个常量(减去PI和999999999)的条件之和,并由编译器插入到其中。

#include <stdio.h>
#include <limits.h>

void
xx(float *x)
{
    unsigned char i = sizeof(*x)*CHAR_BIT-1;
    do {
        switch (i) {
        case 31:
             printf("sign:");
             break;
        case 30:
             printf("exponent:");
             break;
        case 23:
             printf("fraction:");
             break;

        }
        char b=(*(unsigned long long*)x&((unsigned long long)1<<i))!=0;
        printf("%d ", b);
    } while (i--);
    printf("\n");
}

void
yy(float a)
{
    int sign=!(*(unsigned long long*)&a&((unsigned long long)1<<31));
    int fraction = ((1<<23)-1)&(*(int*)&a);
    int exponent = (255&((*(int*)&a)>>23))-127;

    printf(sign?"positive" " ( 1+":"negative" " ( 1+");
    unsigned int i = 1<<22;
    unsigned int j = 1;
    do {
        char b=(fraction&i)!=0;
        b&&(printf("1/(%d) %c", 1<<j, (fraction&(i-1))?'+':')' ), 0);
    } while (j++, i>>=1);

    printf("*2^%d", exponent);
    printf("\n");
}

void
main()
{
    float x=-3.14;
    float y=999999999;
    printf("%lu\n", sizeof(x));
    xx(&x);
    xx(&y);
    yy(x);
    yy(y);
}

这是一个控制台会话,在其中我计算硬件中存在的float的实际值。我用bc打印主程序输出的术语总和。可以将这个总和插入python repl或类似的文件中。

-- .../terra1/stub
@ qemacs f.c
-- .../terra1/stub
@ gcc f.c
-- .../terra1/stub
@ ./a.out
sign:1 exponent:1 0 0 0 0 0 0 fraction:0 1 0 0 1 0 0 0 1 1 1 1 0 1 0 1 1 1 0 0 0 0 1 1
sign:0 exponent:1 0 0 1 1 1 0 fraction:0 1 1 0 1 1 1 0 0 1 1 0 1 0 1 1 0 0 1 0 1 0 0 0
negative ( 1+1/(2) +1/(16) +1/(256) +1/(512) +1/(1024) +1/(2048) +1/(8192) +1/(32768) +1/(65536) +1/(131072) +1/(4194304) +1/(8388608) )*2^1
positive ( 1+1/(2) +1/(4) +1/(16) +1/(32) +1/(64) +1/(512) +1/(1024) +1/(4096) +1/(16384) +1/(32768) +1/(262144) +1/(1048576) )*2^29
-- .../terra1/stub
@ bc
scale=15
( 1+1/(2) +1/(4) +1/(16) +1/(32) +1/(64) +1/(512) +1/(1024) +1/(4096) +1/(16384) +1/(32768) +1/(262144) +1/(1048576) )*2^29
999999999.999999446351872

就是这样。值999999999实际上是

999999999.999999446351872

您还可以使用bc检查-3.14是否也受到干扰。不要忘记在bc中设置scale系数。

显示的总和就是硬件内部的总和。通过计算获得的值取决于您设置的比例。我确实将scale因子设置为15。从数学上讲,它的精度是无限的,似乎是1,000,000,000。

A21:

Just for fun, I played with the representation of floats, following the definitions from the Standard C99 and I wrote the code below.

The code prints the binary representation of floats in 3 separated groups

SIGN EXPONENT FRACTION

and after that it prints a sum, that, when summed with enough precision, it will show the value that really exists in hardware.

So when you write float x = 999..., the compiler will transform that number in a bit representation printed by the function xx such that the sum printed by the function yy be equal to the given number.

In reality, this sum is only an approximation. For the number 999,999,999 the compiler will insert in bit representation of the float the number 1,000,000,000

After the code I attach a console session, in which I compute the sum of terms for both constants (minus PI and 999999999) that really exists in hardware, inserted there by the compiler.

#include <stdio.h>
#include <limits.h>

void
xx(float *x)
{
    unsigned char i = sizeof(*x)*CHAR_BIT-1;
    do {
        switch (i) {
        case 31:
             printf("sign:");
             break;
        case 30:
             printf("exponent:");
             break;
        case 23:
             printf("fraction:");
             break;

        }
        char b=(*(unsigned long long*)x&((unsigned long long)1<<i))!=0;
        printf("%d ", b);
    } while (i--);
    printf("\n");
}

void
yy(float a)
{
    int sign=!(*(unsigned long long*)&a&((unsigned long long)1<<31));
    int fraction = ((1<<23)-1)&(*(int*)&a);
    int exponent = (255&((*(int*)&a)>>23))-127;

    printf(sign?"positive" " ( 1+":"negative" " ( 1+");
    unsigned int i = 1<<22;
    unsigned int j = 1;
    do {
        char b=(fraction&i)!=0;
        b&&(printf("1/(%d) %c", 1<<j, (fraction&(i-1))?'+':')' ), 0);
    } while (j++, i>>=1);

    printf("*2^%d", exponent);
    printf("\n");
}

void
main()
{
    float x=-3.14;
    float y=999999999;
    printf("%lu\n", sizeof(x));
    xx(&x);
    xx(&y);
    yy(x);
    yy(y);
}

Here is a console session in which I compute the real value of the float that exists in hardware. I used bc to print the sum of terms outputted by the main program. One can insert that sum in python repl or something similar also.

-- .../terra1/stub
@ qemacs f.c
-- .../terra1/stub
@ gcc f.c
-- .../terra1/stub
@ ./a.out
sign:1 exponent:1 0 0 0 0 0 0 fraction:0 1 0 0 1 0 0 0 1 1 1 1 0 1 0 1 1 1 0 0 0 0 1 1
sign:0 exponent:1 0 0 1 1 1 0 fraction:0 1 1 0 1 1 1 0 0 1 1 0 1 0 1 1 0 0 1 0 1 0 0 0
negative ( 1+1/(2) +1/(16) +1/(256) +1/(512) +1/(1024) +1/(2048) +1/(8192) +1/(32768) +1/(65536) +1/(131072) +1/(4194304) +1/(8388608) )*2^1
positive ( 1+1/(2) +1/(4) +1/(16) +1/(32) +1/(64) +1/(512) +1/(1024) +1/(4096) +1/(16384) +1/(32768) +1/(262144) +1/(1048576) )*2^29
-- .../terra1/stub
@ bc
scale=15
( 1+1/(2) +1/(4) +1/(16) +1/(32) +1/(64) +1/(512) +1/(1024) +1/(4096) +1/(16384) +1/(32768) +1/(262144) +1/(1048576) )*2^29
999999999.999999446351872

That's it. The value of 999999999 is in fact

999999999.999999446351872

You can also check with bc that -3.14 is also perturbed. Do not forget to set a scale factor in bc.

The displayed sum is what inside the hardware. The value you obtain by computing it depends on the scale you set. I did set the scale factor to 15. Mathematically, with infinite precision, it seems it is 1,000,000,000.

回答22:

另一种查看方式:使用64位代表数字。因此,最多可以精确表示2 ** 64 = 18,446,744,073,709,551,616个不同的数字。

但是,Math说在0到1之间已经有无数个小数。IEE754定义了一种编码,可以有效地使用这64位来处理更大的数字空间以及NaN和+/- Infinity,因此精确表示之间存在间隙用仅近似数字填充的数字。

不幸的是,0.3的差距。

A22:

Another way to look at this: Used are 64 bits to represent numbers. As consequence there is no way more than 2**64 = 18,446,744,073,709,551,616 different numbers can be precisely represented.

However, Math says there are already infinitely many decimals between 0 and 1. IEE 754 defines an encoding to use these 64 bits efficiently for a much larger number space plus NaN and +/- Infinity, so there are gaps between accurately represented numbers filled with numbers only approximated.

Unfortunately 0.3 sits in a gap.

回答23:

由于该线程在当前浮点实现的一般讨论中有所分支,因此我补充说,有一些项目正在解决其问题。

例如,以 https://posithub.org/ 为例,其中展示了一个称为posit的数字类型(及其前身unum)承诺以更少的位数提供更好的准确性。如果我的理解是正确的,那么它也可以解决问题中的这类问题。非常有趣的项目,它背后的人是数学家,博士。约翰·古斯塔夫森(John Gustafson)。整个过程都是开源的,在C / C ++,Python,Julia和C#中有许多实际实现( https:// hastlayer。 com / arithmetics )。

A23:

Since this thread branched off a bit into a general discussion over current floating point implementations I'd add that there are projects on fixing their issues.

Take a look at https://posithub.org/ for example, which showcases a number type called posit (and its predecessor unum) that promises to offer better accuracy with fewer bits. If my understanding is correct, it also fixes the kind of problems in the question. Quite interesting project, the person behind it is a mathematician it Dr. John Gustafson. The whole thing is open source, with many actual implementations in C/C++, Python, Julia and C# (https://hastlayer.com/arithmetics).

回答24:

想象一下,以10为基数的工作精度为8位。您检查是否

1/3 + 2 / 3 == 1

并了解其返回false。为什么?好吧,我们有实数

1/3 = 0.333 .... 2/3 = 0.666 ....

截断到小数点后八位,我们得到

0.33333333 + 0.66666666 = 0.99999999

当然与1.00000000的区别完全是0.00000001


具有固定位数的二进制数的情况完全类似。作为实数,我们有

1/10 = 0.0001100110011001100 ...(以2为底)

1/5 = 0.0011001100110011001 ...(以2为底)

如果我们将它们截断为7位,那么我们会得到

0.0001100 + 0.0011001 = 0.0100101
另一方面,

3/10 = 0.01001100110011 ...(以2为底)

被截断为7位的

0.0100110,它们之间的差异恰好是0.0000001


由于这些数字通常以科学计数法存储,因此确切的情况稍微有些微妙。因此,例如,根据分配的位数,我们可以将其存储为1.10011*2^-4之类的内容,而不是将其存储为0.0001100取指数和尾数这会影响您的计算精度。

结果是,由于存在这些舍入错误,您实际上根本不想在浮点数上使用==。相反,您可以检查它们的差的绝对值是否小于某个固定的小数。

A24:

Imagine working in base ten with, say, 8 digits of accuracy. You check whether

1/3 + 2 / 3 == 1

and learn that this returns false. Why? Well, as real numbers we have

1/3 = 0.333.... and 2/3 = 0.666....

Truncating at eight decimal places, we get

0.33333333 + 0.66666666 = 0.99999999

which is, of course, different from 1.00000000 by exactly 0.00000001.


The situation for binary numbers with a fixed number of bits is exactly analogous. As real numbers, we have

1/10 = 0.0001100110011001100... (base 2)

and

1/5 = 0.0011001100110011001... (base 2)

If we truncated these to, say, seven bits, then we'd get

0.0001100 + 0.0011001 = 0.0100101

while on the other hand,

3/10 = 0.01001100110011... (base 2)

which, truncated to seven bits, is 0.0100110, and these differ by exactly 0.0000001.


The exact situation is slightly more subtle because these numbers are typically stored in scientific notation. So, for instance, instead of storing 1/10 as 0.0001100 we may store it as something like 1.10011 * 2^-4, depending on how many bits we've allocated for the exponent and the mantissa. This affects how many digits of precision you get for your calculations.

The upshot is that because of these rounding errors you essentially never want to use == on floating-point numbers. Instead, you can check if the absolute value of their difference is smaller than some fixed small number.

回答25:

这实际上很简单。当您拥有以10为基数的系统(如我们的系统)时,它只能表示使用基数素数的分数。 10的素数是2和5。因此,由于分母都使用10的素数,所以1 / 2、1 / 4、1 / 5、1 / 8和1/10都可以清楚地表示。 / 3、1 / 6和1/7都是重复的小数,因为它们的分母使用3或7的质数。在二进制(或基数2)中,唯一的质数是2。因此,您只能清楚地表达小数仅包含2作为主要因子。以二进制形式,1 / 2、1 / 4、1 / 8都将干净地表示为小数。而1/​​5或1/10将重复小数。因此,在以10为基数的系统中使用干净的小数时,0.1和0.2(1/10和1/5)在计算机正在运行的以2为基数的系统中重复小数。当对这些重复的小数进行数学运算时,最终会剩下余数当您将计算机的2进制数(二进制)转换为更易读的10进制数时,这些值会保留下来。

来自 https://0.30000000000000004.com/

A25:

It's actually pretty simple. When you have a base 10 system (like ours), it can only express fractions that use a prime factor of the base. The prime factors of 10 are 2 and 5. So 1/2, 1/4, 1/5, 1/8, and 1/10 can all be expressed cleanly because the denominators all use prime factors of 10. In contrast, 1/3, 1/6, and 1/7 are all repeating decimals because their denominators use a prime factor of 3 or 7. In binary (or base 2), the only prime factor is 2. So you can only express fractions cleanly which only contain 2 as a prime factor. In binary, 1/2, 1/4, 1/8 would all be expressed cleanly as decimals. While, 1/5 or 1/10 would be repeating decimals. So 0.1 and 0.2 (1/10 and 1/5) while clean decimals in a base 10 system, are repeating decimals in the base 2 system the computer is operating in. When you do math on these repeating decimals, you end up with leftovers which carry over when you convert the computer's base 2 (binary) number into a more human readable base 10 number.

From https://0.30000000000000004.com/

回答26:

从Python 3.5起,您可以使用math.isclose()函数,用于测试近似相等性:

>>> import math
>>> math.isclose(0.1 + 0.2, 0.3)
True
>>> 0.1 + 0.2 == 0.3
False

A26:

Since Python 3.5 you can use math.isclose() function for testing approximate equality:

>>> import math
>>> math.isclose(0.1 + 0.2, 0.3)
True
>>> 0.1 + 0.2 == 0.3
False

回答27:

十进制数(例如0.10.20.3)不能以二进制编码的浮点类型精确表示。 0.10.2的近似值之和不同于0.3的近似值,因此0.1+0.2==的虚假性0.3 ,如下所示:

#include <stdio.h>

int main() {
    printf("0.1 + 0.2 == 0.3 is %s\n", 0.1 + 0.2 == 0.3 ? "true" : "false");
    printf("0.1 is %.23f\n", 0.1);
    printf("0.2 is %.23f\n", 0.2);
    printf("0.1 + 0.2 is %.23f\n", 0.1 + 0.2);
    printf("0.3 is %.23f\n", 0.3);
    printf("0.3 - (0.1 + 0.2) is %g\n", 0.3 - (0.1 + 0.2));
    return 0;
}

输出:

0.1 + 0.2 == 0.3 is false
0.1 is 0.10000000000000000555112
0.2 is 0.20000000000000001110223
0.1 + 0.2 is 0.30000000000000004440892
0.3 is 0.29999999999999998889777
0.3 - (0.1 + 0.2) is -5.55112e-17

要更可靠地评估这些计算,您需要对浮点值使用基于十进制的表示形式。 C标准默认不指定此类类型,而是作为技术报告

_Decimal32_Decimal64_Decimal128类型可能在您的系统上可用(例如, GCC 选定的目标,但 lang语 OS X )上不支持它们。

A27:

Decimal numbers such as 0.1, 0.2, and 0.3 are not represented exactly in binary encoded floating point types. The sum of the approximations for 0.1 and 0.2 differs from the approximation used for 0.3, hence the falsehood of 0.1 + 0.2 == 0.3 as can be seen more clearly here:

#include <stdio.h>

int main() {
    printf("0.1 + 0.2 == 0.3 is %s\n", 0.1 + 0.2 == 0.3 ? "true" : "false");
    printf("0.1 is %.23f\n", 0.1);
    printf("0.2 is %.23f\n", 0.2);
    printf("0.1 + 0.2 is %.23f\n", 0.1 + 0.2);
    printf("0.3 is %.23f\n", 0.3);
    printf("0.3 - (0.1 + 0.2) is %g\n", 0.3 - (0.1 + 0.2));
    return 0;
}

Output:

0.1 + 0.2 == 0.3 is false
0.1 is 0.10000000000000000555112
0.2 is 0.20000000000000001110223
0.1 + 0.2 is 0.30000000000000004440892
0.3 is 0.29999999999999998889777
0.3 - (0.1 + 0.2) is -5.55112e-17

For these computations to be evaluated more reliably, you would need to use a decimal-based representation for floating point values. The C Standard does not specify such types by default but as an extension described in a technical Report.

The _Decimal32, _Decimal64 and _Decimal128 types might be available on your system (for example, GCC supports them on selected targets, but Clang does not support them on OSX).

回答28:

Math.sum (javascript)....一种替换运算符

.1 + .0001 + -.1 --> 0.00010000000000000286
Math.sum(.1 , .0001, -.1) --> 0.0001

Object.defineProperties(Math, {
    sign: {
        value: function (x) {
            return x ? x < 0 ? -1 : 1 : 0;
            }
        },
    precision: {
        value: function (value, precision, type) {
            var v = parseFloat(value), 
                p = Math.max(precision, 0) || 0, 
                t = type || 'round';
            return (Math[t](v * Math.pow(10, p)) / Math.pow(10, p)).toFixed(p);
        }
    },
    scientific_to_num: {  // this is from https://gist.github.com/jiggzson
        value: function (num) {
            //if the number is in scientific notation remove it
            if (/e/i.test(num)) {
                var zero = '0',
                        parts = String(num).toLowerCase().split('e'), //split into coeff and exponent
                        e = parts.pop(), //store the exponential part
                        l = Math.abs(e), //get the number of zeros
                        sign = e / l,
                        coeff_array = parts[0].split('.');
                if (sign === -1) {
                    num = zero + '.' + new Array(l).join(zero) + coeff_array.join('');
                } else {
                    var dec = coeff_array[1];
                    if (dec)
                        l = l - dec.length;
                    num = coeff_array.join('') + new Array(l + 1).join(zero);
                }
            }
            return num;
         }
     }
    get_precision: {
        value: function (number) {
            var arr = Math.scientific_to_num((number + "")).split(".");
            return arr[1] ? arr[1].length : 0;
        }
    },
    diff:{
        value: function(A,B){
            var prec = this.max(this.get_precision(A),this.get_precision(B));
            return +this.precision(A-B,prec);
        }
    },
    sum: {
        value: function () {
            var prec = 0, sum = 0;
            for (var i = 0; i < arguments.length; i++) {
                prec = this.max(prec, this.get_precision(arguments[i]));
                sum += +arguments[i]; // force float to convert strings to number
            }
            return Math.precision(sum, prec);
        }
    }
});

这个想法是使用数学运算符来避免浮点错误

Math.diff(0.2, 0.11) == 0.09 // true
0.2 - 0.11 == 0.09 // false

还请注意,Math.diff和Math.sum自动检测要使用的精度

Math.sum接受任意数量的参数

A28:

Math.sum ( javascript ) .... kind of operator replacement

.1 + .0001 + -.1 --> 0.00010000000000000286
Math.sum(.1 , .0001, -.1) --> 0.0001

Object.defineProperties(Math, {
    sign: {
        value: function (x) {
            return x ? x < 0 ? -1 : 1 : 0;
            }
        },
    precision: {
        value: function (value, precision, type) {
            var v = parseFloat(value), 
                p = Math.max(precision, 0) || 0, 
                t = type || 'round';
            return (Math[t](v * Math.pow(10, p)) / Math.pow(10, p)).toFixed(p);
        }
    },
    scientific_to_num: {  // this is from https://gist.github.com/jiggzson
        value: function (num) {
            //if the number is in scientific notation remove it
            if (/e/i.test(num)) {
                var zero = '0',
                        parts = String(num).toLowerCase().split('e'), //split into coeff and exponent
                        e = parts.pop(), //store the exponential part
                        l = Math.abs(e), //get the number of zeros
                        sign = e / l,
                        coeff_array = parts[0].split('.');
                if (sign === -1) {
                    num = zero + '.' + new Array(l).join(zero) + coeff_array.join('');
                } else {
                    var dec = coeff_array[1];
                    if (dec)
                        l = l - dec.length;
                    num = coeff_array.join('') + new Array(l + 1).join(zero);
                }
            }
            return num;
         }
     }
    get_precision: {
        value: function (number) {
            var arr = Math.scientific_to_num((number + "")).split(".");
            return arr[1] ? arr[1].length : 0;
        }
    },
    diff:{
        value: function(A,B){
            var prec = this.max(this.get_precision(A),this.get_precision(B));
            return +this.precision(A-B,prec);
        }
    },
    sum: {
        value: function () {
            var prec = 0, sum = 0;
            for (var i = 0; i < arguments.length; i++) {
                prec = this.max(prec, this.get_precision(arguments[i]));
                sum += +arguments[i]; // force float to convert strings to number
            }
            return Math.precision(sum, prec);
        }
    }
});

the idea is to use Math instead operators to avoid float errors

Math.diff(0.2, 0.11) == 0.09 // true
0.2 - 0.11 == 0.09 // false

also note that Math.diff and Math.sum auto-detect the precision to use

Math.sum accepts any number of arguments

回答29:

我刚刚看到了有关浮点数的有趣问题:

考虑以下结果:

error = (2**53+1) - int(float(2**53+1))
>>> (2**53+1) - int(float(2**53+1))
1

2**53+1时,我们可以清楚地看到一个断点-直到2**53都可以正常工作。

>>> (2**53) - int(float(2**53))
0

在此处输入图片描述

发生这种情况的原因是双精度二进制:IEEE 754双精度二进制浮点格式:binary64

从Wikipedia页面上的双精度浮点格式

双精度二进制浮点是PC上常用的格式,尽管其性能和带宽成本较高,但其范围比单精度浮点宽。与单精度浮点格式一样,与相同大小的整数格式相比,它在整数上缺乏精度。通常简称为double。 IEEE 754标准将binary64指定为具有:

  • 符号位:1位
  • 指数:11位
  • 极高的精度:53位(显式存储52位)

在此处输入图片描述

由具有给定偏置指数和52位分数的给定64位双精度基准所假定的实数值为

在此处输入图片描述

在此处输入图片描述

感谢@a_guest向我指出这一点。

A29:

I just saw this interesting issue around floating points:

Consider the following results:

error = (2**53+1) - int(float(2**53+1))
>>> (2**53+1) - int(float(2**53+1))
1

We can clearly see a breakpoint when 2**53+1 - all works fine until 2**53.

>>> (2**53) - int(float(2**53))
0

This happens because of the double-precision binary: IEEE 754 double-precision binary floating-point format: binary64

From the Wikipedia page for Double-precision floating-point format:

Double-precision binary floating-point is a commonly used format on PCs, due to its wider range over single-precision floating point, in spite of its performance and bandwidth cost. As with single-precision floating-point format, it lacks precision on integer numbers when compared with an integer format of the same size. It is commonly known simply as double. The IEEE 754 standard specifies a binary64 as having:

  • Sign bit: 1 bit
  • Exponent: 11 bits
  • Significant precision: 53 bits (52 explicitly stored)

The real value assumed by a given 64-bit double-precision datum with a given biased exponent and a 52-bit fraction is

or

Thanks to @a_guest for pointing that out to me.

回答30:

另一个问题被重复命名为一个问题:

在C ++中,为什么cout< 的结果与调试器为x显示的值不同?

问题中的x是一个float变量。

一个例子是

float x = 9.9F;

调试器显示9.89999962cout操作的输出为9.9

答案是cout对于float的默认精度是6,因此四舍五入为6个十进制数字。

请参见此处以供参考

A30:

A different question has been named as a duplicate to this one:

In C++, why is the result of cout << x different from the value that a debugger is showing for x?

The x in the question is a float variable.

One example would be

float x = 9.9F;

The debugger shows 9.89999962, the output of cout operation is 9.9.

The answer turns out to be that cout's default precision for float is 6, so it rounds to 6 decimal digits.

See here for reference

回到顶部