|
|
xv | |
|
|
xix | |
Preface |
|
xxiii | |
|
Part I Introduction, Basic Definitions, and Standards |
|
|
1 | (94) |
|
|
3 | (12) |
|
|
4 | (3) |
|
|
7 | (1) |
|
1.3 Some Strange Behaviors |
|
|
8 | (7) |
|
|
8 | (1) |
|
|
9 | (6) |
|
2 Definitions and Basic Notions |
|
|
15 | (32) |
|
2.1 Floating-Point Numbers |
|
|
15 | (7) |
|
|
15 | (2) |
|
2.1.2 Normalized representations, normal and subnormal numbers |
|
|
17 | (2) |
|
2.1.3 A note on underflow |
|
|
19 | (2) |
|
2.1.4 Special floating-point data |
|
|
21 | (1) |
|
|
22 | (3) |
|
|
22 | (2) |
|
|
24 | (1) |
|
2.3 Tools for Manipulating Floating-Point Errors |
|
|
25 | (12) |
|
2.3.1 Relative error due to rounding |
|
|
25 | (4) |
|
|
29 | (5) |
|
2.3.3 Link between errors in ulps and relative errors |
|
|
34 | (1) |
|
2.3.4 An example: iterated products |
|
|
35 | (2) |
|
2.4 The Fused Multiply-Add (FMA) Instruction |
|
|
37 | (1) |
|
|
37 | (3) |
|
2.6 Lost and Preserved Properties of Real Arithmetic |
|
|
40 | (1) |
|
2.7 Note on the Choice of the Radix |
|
|
41 | (3) |
|
2.7.1 Representation errors |
|
|
41 | (2) |
|
2.7.2 A case for radix 10 |
|
|
43 | (1) |
|
|
44 | (3) |
|
3 Floating-Point Formats and Environment |
|
|
47 | (48) |
|
3.1 The IEEE 754-2008 Standard |
|
|
48 | (31) |
|
|
48 | (18) |
|
3.1.2 Attributes and rounding |
|
|
66 | (4) |
|
3.1.3 Operations specified by the standard |
|
|
70 | (2) |
|
|
72 | (1) |
|
3.1.5 Conversions to/from string representations |
|
|
73 | (1) |
|
3.1.6 Default exception handling |
|
|
74 | (3) |
|
|
77 | (2) |
|
3.1.8 Recommended functions |
|
|
79 | (1) |
|
3.2 On the Possible Hidden Use of a Higher Internal Precision |
|
|
79 | (3) |
|
3.3 Revision of the IEEE 754-2008 Standard |
|
|
82 | (1) |
|
3.4 Floating-Point Hardware in Current Processors |
|
|
83 | (6) |
|
3.4.1 The common hardware denominator |
|
|
83 | (1) |
|
|
84 | (1) |
|
3.4.3 Extended precision and 128-bit formats |
|
|
85 | (1) |
|
3.4.4 Rounding and precision control |
|
|
85 | (1) |
|
|
86 | (1) |
|
3.4.6 Binaryl6 (half-precision) support |
|
|
87 | (1) |
|
|
87 | (1) |
|
3.4.8 The legacy x87 processor |
|
|
88 | (1) |
|
3.5 Floating-Point Hardware in Recent Graphics Processing Units |
|
|
89 | (1) |
|
3.6 IEEE Support in Programming Languages |
|
|
90 | (1) |
|
3.7 Checking the Environment |
|
|
91 | (4) |
|
|
91 | (1) |
|
|
92 | (1) |
|
|
92 | (1) |
|
|
92 | (1) |
|
|
93 | (2) |
|
Part II Cleverly Using Floating-Point Arithmetic |
|
|
95 | (136) |
|
4 Basic Properties and Algorithms |
|
|
97 | (66) |
|
4.1 Testing the Computational Environment |
|
|
97 | (3) |
|
4.1.1 Computing the radix |
|
|
97 | (2) |
|
4.1.2 Computing the precision |
|
|
99 | (1) |
|
|
100 | (3) |
|
|
100 | (3) |
|
4.2.2 Exact multiplications and divisions |
|
|
103 | (1) |
|
4.3 Accurate Computation of the Sum of Two Numbers |
|
|
103 | (8) |
|
4.3.1 The Fast2Sum algorithm |
|
|
104 | (3) |
|
|
107 | (2) |
|
4.3.3 If we do not use rounding to nearest |
|
|
109 | (2) |
|
4.4 Accurate Computation of the Product of Two Numbers |
|
|
111 | (9) |
|
4.4.1 The 2MultFMA Algorithm |
|
|
112 | (1) |
|
4.4.2 If no FMA instruction is available: Veltkamp splitting and Dekker product |
|
|
113 | (7) |
|
4.5 Computation of Residuals of Division and Square Root with an FMA |
|
|
120 | (3) |
|
4.6 Another splitting technique: splitting around a power of 2 |
|
|
123 | (1) |
|
4.7 Newton-Raphson-Based Division with an FMA |
|
|
124 | (14) |
|
4.7.1 Variants of the Newton-Raphson iteration |
|
|
124 | (5) |
|
4.7.2 Using the Newton-Raphson iteration for correctly rounded division with an FMA |
|
|
129 | (7) |
|
4.7.3 Possible double roundings in division algorithms |
|
|
136 | (2) |
|
4.8 Newton-Raphson-Based Square Root with an FMA |
|
|
138 | (4) |
|
4.8.1 The basic iterations |
|
|
138 | (1) |
|
4.8.2 Using the Newton-Raphson iteration for correctly rounded square roots |
|
|
138 | (4) |
|
|
142 | (11) |
|
4.9.1 Conditions on the formats |
|
|
142 | (4) |
|
4.9.2 Conversion algorithms |
|
|
146 | (7) |
|
4.10 Conversion Between Integers and Floating-Point Numbers |
|
|
153 | (3) |
|
4.10.1 From 32-bit integers to floating-point numbers |
|
|
153 | (1) |
|
4.10.2 From 64-bit integers to floating-point numbers |
|
|
154 | (1) |
|
4.10.3 From floating-point numbers to integers |
|
|
155 | (1) |
|
4.11 Multiplication by an Arbitrary-Precision Constant with an FMA |
|
|
156 | (4) |
|
4.12 Evaluation of the Error of an FMA |
|
|
160 | (3) |
|
5 Enhanced Floating-Point Sums, Dot Products, and Polynomial Values |
|
|
163 | (30) |
|
|
164 | (11) |
|
5.1.1 Floating-point arithmetic models |
|
|
165 | (1) |
|
5.1.2 Notation for error analysis and classical error estimates |
|
|
166 | (3) |
|
5.1.3 Some refined error estimates |
|
|
169 | (5) |
|
5.1.4 Properties for deriving validated running error bounds |
|
|
174 | (1) |
|
5.2 Computing Validated Running Error Bounds |
|
|
175 | (2) |
|
5.3 Computing Sums More Accurately |
|
|
177 | (12) |
|
5.3.1 Reordering the operands, and a bit more |
|
|
177 | (1) |
|
|
178 | (6) |
|
5.3.3 Summation algorithms that somehow imitate a fixed-point arithmetic |
|
|
184 | (3) |
|
5.3.4 On the sum of three floating-point numbers |
|
|
187 | (2) |
|
5.4 Compensated Dot Products |
|
|
189 | (1) |
|
5.5 Compensated Polynomial Evaluation |
|
|
190 | (3) |
|
6 Languages and Compilers |
|
|
193 | (38) |
|
6.1 A Play with Many Actors |
|
|
193 | (7) |
|
6.1.1 Floating-point evaluation in programming languages |
|
|
194 | (2) |
|
6.1.2 Processors, compilers, and operating systems |
|
|
196 | (1) |
|
6.1.3 Standardization processes |
|
|
197 | (2) |
|
6.1.4 In the hands of the programmer |
|
|
199 | (1) |
|
6.2 Floating Point in the C Language |
|
|
200 | (15) |
|
6.2.1 Standard C11 headers and IEEE 754-1985 support |
|
|
201 | (1) |
|
|
202 | (2) |
|
6.2.3 Expression evaluation |
|
|
204 | (5) |
|
6.2.4 Code transformations |
|
|
209 | (1) |
|
6.2.5 Enabling unsafe optimizations |
|
|
210 | (1) |
|
6.2.6 Summary: a few horror stories |
|
|
211 | (3) |
|
6.2.7 The CompCert C compiler |
|
|
214 | (1) |
|
6.3 Floating-Point Arithmetic in the C++ Language |
|
|
215 | (3) |
|
|
215 | (1) |
|
|
215 | (2) |
|
6.3.3 Overloaded functions |
|
|
217 | (1) |
|
6.4 FORTRAN Floating Point in a Nutshell |
|
|
218 | (4) |
|
|
218 | (2) |
|
6.4.2 IEEE 754 support in FORTRAN |
|
|
220 | (2) |
|
6.5 Java Floating Point in a Nutshell |
|
|
222 | (7) |
|
|
222 | (1) |
|
|
222 | (2) |
|
6.5.3 Infinities, NaNs, and signed zeros |
|
|
224 | (1) |
|
|
225 | (1) |
|
|
226 | (1) |
|
6.5.6 The BigDecimal package |
|
|
227 | (2) |
|
|
229 | (2) |
|
Part III Implementing Floating-Point Operators |
|
|
231 | (204) |
|
7 Algorithms for the Basic Operations |
|
|
233 | (34) |
|
7.1 Overview of Basic Operation Implementation |
|
|
233 | (2) |
|
7.2 Implementing IEEE 754-2008 Rounding |
|
|
235 | (5) |
|
7.2.1 Rounding a nonzero finite value with unbounded exponent range |
|
|
235 | (3) |
|
|
238 | (1) |
|
7.2.3 Underflow and subnormal results |
|
|
239 | (1) |
|
7.2.4 The inexact exception |
|
|
239 | (1) |
|
7.2.5 Rounding for actual operations |
|
|
240 | (1) |
|
7.3 Floating-Point Addition and Subtraction |
|
|
240 | (5) |
|
|
244 | (1) |
|
7.3.2 Decimal addition using binary encoding |
|
|
245 | (1) |
|
7.3.3 Subnormal inputs and outputs in binary addition |
|
|
245 | (1) |
|
7.4 Floating-Point Multiplication |
|
|
245 | (3) |
|
|
246 | (1) |
|
7.4.2 Handling subnormal numbers in binary multiplication |
|
|
247 | (1) |
|
|
248 | (1) |
|
7.5 Floating-Point Fused Multiply-Add |
|
|
248 | (8) |
|
7.5.1 Case analysis for normal inputs |
|
|
249 | (3) |
|
7.5.2 Handling subnormal inputs |
|
|
252 | (1) |
|
7.5.3 Handling decimal cohorts |
|
|
253 | (1) |
|
7.5.4 Overview of a binary FMA implementation |
|
|
254 | (2) |
|
7.6 Floating-Point Division |
|
|
256 | (3) |
|
7.6.1 Overview and special cases |
|
|
256 | (1) |
|
7.6.2 Computing the significand quotient |
|
|
257 | (1) |
|
7.6.3 Managing subnormal numbers |
|
|
258 | (1) |
|
7.6.4 The inexact exception |
|
|
259 | (1) |
|
|
259 | (1) |
|
7.7 Floating-Point Square Root |
|
|
259 | (2) |
|
7.7.1 Overview and special cases |
|
|
259 | (1) |
|
7.7.2 Computing the significand square root |
|
|
260 | (1) |
|
7.7.3 Managing subnormal numbers |
|
|
260 | (1) |
|
7.7.4 The inexact exception |
|
|
261 | (1) |
|
|
261 | (1) |
|
7.8 Nonhomogeneous Operators |
|
|
261 | (6) |
|
7.8.1 A software algorithm around double rounding |
|
|
262 | (2) |
|
7.8.2 The mixed-precision fused multiply-and-add |
|
|
264 | (1) |
|
|
265 | (1) |
|
7.8.4 Implementation issues |
|
|
265 | (2) |
|
8 Hardware Implementation of Floating-Point Arithmetic |
|
|
267 | (54) |
|
8.1 Introduction and Context |
|
|
267 | (5) |
|
8.1.1 Processor internal formats |
|
|
267 | (1) |
|
8.1.2 Hardware handling of subnormal numbers |
|
|
268 | (1) |
|
8.1.3 Full-custom VLSI versus reconfigurable circuits (FPGAs) |
|
|
269 | (1) |
|
8.1.4 Hardware decimal arithmetic |
|
|
270 | (1) |
|
|
271 | (1) |
|
8.2 The Primitives and Their Cost |
|
|
272 | (15) |
|
|
272 | (6) |
|
8.2.2 Digit-by-integer multiplication in hardware |
|
|
278 | (1) |
|
8.2.3 Using nonstandard representations of numbers |
|
|
278 | (2) |
|
8.2.4 Binary integer multiplication |
|
|
280 | (1) |
|
8.2.5 Decimal integer multiplication |
|
|
280 | (2) |
|
|
282 | (1) |
|
8.2.7 Leading-zero counters |
|
|
283 | (2) |
|
8.2.8 Tables and table-based methods for fixed-point function approximation |
|
|
285 | (2) |
|
8.3 Binary Floating-Point Addition |
|
|
287 | (8) |
|
|
287 | (1) |
|
8.3.2 A first dual-path architecture |
|
|
288 | (2) |
|
8.3.3 Leading-zero anticipation |
|
|
290 | (4) |
|
8.3.4 Probing further on floating-point adders |
|
|
294 | (1) |
|
8.4 Binary Floating-Point Multiplication |
|
|
295 | (6) |
|
|
295 | (1) |
|
8.4.2 FPGA implementation |
|
|
296 | (1) |
|
8.4.3 VLSI implementation optimized for delay |
|
|
297 | (3) |
|
8.4.4 Managing subnormals |
|
|
300 | (1) |
|
8.5 Binary Fused Multiply-Add |
|
|
301 | (3) |
|
8.5.1 Classic architecture |
|
|
301 | (1) |
|
|
302 | (2) |
|
8.6 Division and Square Root |
|
|
304 | (5) |
|
8.6.1 Digit-recurrence division |
|
|
305 | (3) |
|
|
308 | (1) |
|
8.7 Beyond the Classical Floating-Point Unit |
|
|
309 | (3) |
|
8.7.1 More fused operators |
|
|
309 | (1) |
|
8.7.2 Exact accumulation and dot product |
|
|
309 | (2) |
|
8.7.3 Hardware-accelerated compensated algorithms |
|
|
311 | (1) |
|
8.8 Floating-Point for FPGAs |
|
|
312 | (8) |
|
8.8.1 Optimization in context of standard operators |
|
|
312 | (2) |
|
8.8.2 Operations with a constant operand |
|
|
314 | (1) |
|
8.8.3 Computing large floating-point sums |
|
|
315 | (4) |
|
8.8.4 Block floating point |
|
|
319 | (1) |
|
8.8.5 Algebraic operators |
|
|
319 | (1) |
|
8.8.6 Elementary and compound functions |
|
|
320 | (1) |
|
|
320 | (1) |
|
9 Software Implementation of Floating-Point Arithmetic |
|
|
321 | (54) |
|
9.1 Implementation Context |
|
|
322 | (7) |
|
9.1.1 Standard encoding of binary floating-point data |
|
|
322 | (1) |
|
9.1.2 Available integer operators |
|
|
323 | (3) |
|
|
326 | (2) |
|
9.1.4 Design choices and optimizations |
|
|
328 | (1) |
|
9.2 Binary Floating-Point Addition |
|
|
329 | (12) |
|
9.2.1 Handling special values |
|
|
330 | (2) |
|
9.2.2 Computing the sign of the result |
|
|
332 | (1) |
|
9.2.3 Swapping the operands and computing the alignment shift |
|
|
333 | (2) |
|
9.2.4 Getting the correctly rounded result |
|
|
335 | (6) |
|
9.3 Binary Floating-Point Multiplication |
|
|
341 | (8) |
|
9.3.1 Handling special values |
|
|
341 | (2) |
|
9.3.2 Sign and exponent computation |
|
|
343 | (2) |
|
|
345 | (1) |
|
9.3.4 Getting the correctly rounded result |
|
|
346 | (3) |
|
9.4 Binary Floating-Point Division |
|
|
349 | (13) |
|
9.4.1 Handling special values |
|
|
350 | (1) |
|
9.4.2 Sign and exponent computation |
|
|
351 | (3) |
|
|
354 | (1) |
|
9.4.4 Getting the correctly rounded result |
|
|
355 | (7) |
|
9.5 Binary Floating-Point Square Root |
|
|
362 | (10) |
|
9.5.1 Handling special values |
|
|
362 | (1) |
|
9.5.2 Exponent computation |
|
|
363 | (2) |
|
9.5.3 Getting the correctly rounded result |
|
|
365 | (7) |
|
|
372 | (3) |
|
10 Evaluating Floating-Point Elementary Functions |
|
|
375 | (60) |
|
|
375 | (4) |
|
|
375 | (1) |
|
10.1.2 The various steps of function evaluation |
|
|
376 | (3) |
|
|
379 | (10) |
|
10.2.1 Basic range reduction algorithms |
|
|
379 | (3) |
|
10.2.2 Bounding the relative error of range reduction |
|
|
382 | (1) |
|
10.2.3 More sophisticated range reduction algorithms |
|
|
383 | (3) |
|
|
386 | (3) |
|
10.3 Polynomial Approximations |
|
|
389 | (7) |
|
|
390 | (1) |
|
10.3.2 L∞, or minimax, case |
|
|
391 | (3) |
|
10.3.3 "Truncated" approximations |
|
|
394 | (1) |
|
10.3.4 In practice: using the Sollya tool to compute constrained approximations and certified error bounds |
|
|
394 | (2) |
|
10.4 Evaluating Polynomials |
|
|
396 | (1) |
|
10.4.1 Evaluation strategies |
|
|
396 | (1) |
|
|
397 | (1) |
|
10.5 The Table Maker's Dilemma |
|
|
397 | (30) |
|
10.5.1 When there is no need to solve the TMD |
|
|
400 | (1) |
|
|
400 | (4) |
|
10.5.3 Finding the hardest-to-round points |
|
|
404 | (23) |
|
10.6 Some Implementation Tricks Used in the CRlibm Library |
|
|
427 | (8) |
|
|
428 | (1) |
|
10.6.2 Accurate second step |
|
|
429 | (1) |
|
10.6.3 Error analysis and the accuracy/performance tradeoff |
|
|
429 | (1) |
|
10.6.4 The point with efficient code |
|
|
430 | (5) |
|
|
435 | (118) |
|
|
437 | (16) |
|
|
437 | (2) |
|
11.2 Componentwise and Normwise Errors |
|
|
439 | (1) |
|
11.3 Computing ad ± be with an FMA |
|
|
440 | (2) |
|
11.4 Complex Multiplication |
|
|
442 | (1) |
|
11.4.1 Complex multiplication without an FMA instruction |
|
|
442 | (1) |
|
11.4.2 Complex multiplication with an FMA instruction |
|
|
442 | (1) |
|
|
443 | (4) |
|
11.5.1 Error bounds for complex division |
|
|
443 | (1) |
|
11.5.2 Scaling methods for avoiding over-/underflow in complex division |
|
|
444 | (3) |
|
11.6 Complex Absolute Value |
|
|
447 | (2) |
|
11.6.1 Error bounds for complex absolute value |
|
|
447 | (1) |
|
11.6.2 Scaling for the computation of complex absolute value |
|
|
447 | (2) |
|
|
449 | (2) |
|
11.7.1 Error bounds for complex square root |
|
|
449 | (1) |
|
11.7.2 Scaling techniques for complex square root |
|
|
450 | (1) |
|
11.8 An Alternative Solution: Exception Handling |
|
|
451 | (2) |
|
|
453 | (26) |
|
12.1 Introduction to Interval Arithmetic |
|
|
454 | (3) |
|
12.1.1 Definitions and the inclusion property |
|
|
454 | (2) |
|
12.1.2 Loss of algebraic properties |
|
|
456 | (1) |
|
12.2 The IEEE 1788-2015 Standard for Interval Arithmetic |
|
|
457 | (8) |
|
12.2.1 Structuration into levels |
|
|
457 | (1) |
|
|
458 | (2) |
|
|
460 | (3) |
|
12.2.4 Level 2: discretization issues |
|
|
463 | (1) |
|
|
464 | (1) |
|
12.2.6 Levels 3 and 4: implementation issues |
|
|
464 | (1) |
|
12.2.7 Libraries implementing IEEE 1788-2015 |
|
|
464 | (1) |
|
12.3 Intervals with Floating-Point Bounds |
|
|
465 | (3) |
|
12.3.1 Implementation using floating-point arithmetic |
|
|
465 | (1) |
|
|
465 | (2) |
|
12.3.3 Optimized rounding |
|
|
467 | (1) |
|
12.4 Interval Arithmetic and Roundoff Error Analysis |
|
|
468 | (8) |
|
12.4.1 Influence of the computing precision |
|
|
468 | (3) |
|
12.4.2 A more efficient approach: the mid-rad representation |
|
|
471 | (4) |
|
12.4.3 Variants: affine arithmetic, Taylor models |
|
|
475 | (1) |
|
|
476 | (3) |
|
13 Verifying Floating-Point Algorithms |
|
|
479 | (34) |
|
13.1 Formalizing Floating-Point Arithmetic |
|
|
479 | (8) |
|
13.1.1 Defining floating-point numbers |
|
|
480 | (2) |
|
13.1.2 Simplifying the definition |
|
|
482 | (1) |
|
13.1.3 Defining rounding operators |
|
|
483 | (3) |
|
13.1.4 Extending the set of numbers |
|
|
486 | (1) |
|
13.2 Formalisms for Verifying Algorithms |
|
|
487 | (6) |
|
|
487 | (2) |
|
13.2.2 Floating-point algorithms |
|
|
489 | (1) |
|
|
490 | (3) |
|
13.3 Roundoff Errors and the Gappa Tool |
|
|
493 | (20) |
|
13.3.1 Computing on bounds |
|
|
494 | (2) |
|
|
496 | (2) |
|
13.3.3 Manipulating expressions |
|
|
498 | (4) |
|
13.3.4 Handling the relative error |
|
|
502 | (1) |
|
13.3.5 Example: toy implementation of sine |
|
|
503 | (5) |
|
13.3.6 Example: integer division on Itanium |
|
|
508 | (5) |
|
14 Extending the Precision |
|
|
513 | (40) |
|
14.1 Double-Words, Triple-Words |
|
|
514 | (8) |
|
14.1.1 Double-word arithmetic |
|
|
515 | (5) |
|
14.1.2 Static triple-word arithmetic |
|
|
520 | (2) |
|
14.2 Floating-Point Expansions |
|
|
522 | (12) |
|
14.2.1 Renormalization of floating-point expansions |
|
|
525 | (1) |
|
14.2.2 Addition of floating-point expansions |
|
|
526 | (2) |
|
14.2.3 Multiplication of floating-point expansions |
|
|
528 | (3) |
|
14.2.4 Division of floating-point expansions |
|
|
531 | (3) |
|
14.3 Floating-Point Numbers with Batched Additional Exponent |
|
|
534 | (1) |
|
14.4 Large Precision Based on a High-Radix Representation |
|
|
535 | (18) |
|
|
536 | (1) |
|
14.4.2 Using arbitrary-precision integer arithmetic for arbitrary-precision floating-point arithmetic |
|
|
537 | (1) |
|
14.4.3 A brief introduction to arbitrary-precision integer arithmetic |
|
|
538 | (3) |
|
|
541 | (12) |
Appendix A Number Theory Tools for Floating-Point Arithmetic |
|
553 | (8) |
|
|
553 | (3) |
|
|
556 | (5) |
Appendix B Previous Floating-Point Standards |
|
561 | (8) |
|
B.1 The IEEE 754-1985 Standard |
|
|
561 | (3) |
|
B.1.1 Formats specified by IEEE 754-1985 |
|
|
561 | (1) |
|
B.1.2 Rounding modes specified by IEEE 754-1985 |
|
|
562 | (1) |
|
B.1.3 Operations specified by IEEE 754-1985 |
|
|
562 | (1) |
|
B.1.4 Exceptions specified by IEEE 754-1985 |
|
|
563 | (1) |
|
B.2 The IEEE 854-1987 Standard |
|
|
564 | (2) |
|
B.2.1 Constraints internal to a format |
|
|
564 | (1) |
|
B.2.2 Various formats and the constraints between them |
|
|
565 | (1) |
|
|
566 | (1) |
|
|
566 | (1) |
|
|
566 | (1) |
|
|
566 | (1) |
|
B.3 The Need for a Revision |
|
|
566 | (1) |
|
B.4 The IEEE 754-2008 Revision |
|
|
567 | (2) |
Bibliography |
|
569 | (52) |
Index |
|
621 | |