Preface |
|
xv | |
|
|
xvii | |
|
|
xxi | |
|
I Introduction, Basic Definitions, and Standards |
|
|
1 | (116) |
|
|
3 | (10) |
|
|
3 | (3) |
|
|
6 | (1) |
|
|
7 | (6) |
|
|
7 | (1) |
|
|
8 | (5) |
|
Definitions and Basic Notions |
|
|
13 | (42) |
|
|
13 | (7) |
|
|
20 | (5) |
|
|
20 | (2) |
|
|
22 | (1) |
|
Relative error due to rounding |
|
|
23 | (2) |
|
|
25 | (2) |
|
Lost or Preserved Properties of the Arithmetic on the Real Numbers |
|
|
27 | (2) |
|
Note on the Choice of the Radix |
|
|
29 | (3) |
|
|
29 | (1) |
|
|
30 | (2) |
|
Tools for Manipulating Floating-Point Errors |
|
|
32 | (8) |
|
|
32 | (5) |
|
Errors in ulps and relative errors |
|
|
37 | (1) |
|
An example: iterated products |
|
|
37 | (2) |
|
|
39 | (1) |
|
|
40 | (11) |
|
Conditions on the formats |
|
|
40 | (3) |
|
|
43 | (8) |
|
The Fused Multiply-Add (FMA) Instruction |
|
|
51 | (1) |
|
|
51 | (4) |
|
Intervals with floating-point bounds |
|
|
52 | (1) |
|
|
52 | (3) |
|
Floating-Point Formats and Environment |
|
|
55 | (62) |
|
The IEEE 754-1985 Standard |
|
|
56 | (14) |
|
Formats specified by IEEE 754-1985 |
|
|
56 | (4) |
|
Little-endian, big-endian |
|
|
60 | (1) |
|
Rounding modes specified by IEEE 754-1985 |
|
|
61 | (1) |
|
Operations specified by IEEE 754-1985 |
|
|
62 | (4) |
|
Exceptions specified by IEEE 754-1985 |
|
|
66 | (3) |
|
|
69 | (1) |
|
The IEEE 854-1987 Standard |
|
|
70 | (4) |
|
Constraints internal to a format |
|
|
70 | (1) |
|
Various formats and the constraints between them |
|
|
71 | (1) |
|
Conversions between floating-point numbers and decimal strings |
|
|
72 | (1) |
|
|
73 | (1) |
|
|
73 | (1) |
|
|
74 | (1) |
|
|
74 | (1) |
|
|
74 | (5) |
|
A typical problem: ``double rounding'' |
|
|
75 | (2) |
|
|
77 | (2) |
|
The New IEEE 754-2008 Standard |
|
|
79 | (25) |
|
Formats specified by the revised standard |
|
|
80 | (1) |
|
Binary interchange format encodings |
|
|
81 | (1) |
|
Decimal interchange format encodings |
|
|
82 | (10) |
|
|
92 | (1) |
|
Extended and extendable precisions |
|
|
92 | (1) |
|
|
93 | (4) |
|
Operations specified by the standard |
|
|
97 | (2) |
|
|
99 | (1) |
|
|
99 | (1) |
|
Default exception handling |
|
|
100 | (3) |
|
Recommended transcendental functions |
|
|
103 | (1) |
|
Floating-Point Hardware in Current Processors |
|
|
104 | (4) |
|
The common hardware denominator |
|
|
104 | (1) |
|
|
104 | (1) |
|
|
104 | (1) |
|
Rounding and precision control |
|
|
105 | (1) |
|
|
106 | (1) |
|
Floating-point on x86 processors: SSE2 versus x87 |
|
|
106 | (1) |
|
|
107 | (1) |
|
Floating-Point Hardware in Recent Graphics Processing Units |
|
|
108 | (1) |
|
Relations with Programming Languages |
|
|
109 | (1) |
|
The Language Independent Arithmetic (LIA) standard |
|
|
109 | (1) |
|
|
110 | (1) |
|
|
110 | (7) |
|
|
111 | (1) |
|
|
111 | (4) |
|
|
115 | (1) |
|
|
116 | (1) |
|
|
116 | (1) |
|
|
116 | (1) |
|
II Cleverly Using Floating-Point Arithmetic |
|
|
117 | (120) |
|
Basic Properties and Algorithms |
|
|
119 | (32) |
|
Testing the Computational Environment |
|
|
119 | (3) |
|
|
119 | (2) |
|
|
121 | (1) |
|
|
122 | (3) |
|
|
122 | (2) |
|
Exact multiplications and divisions |
|
|
124 | (1) |
|
Accurate Computations of Sums of Two Numbers |
|
|
125 | (7) |
|
|
126 | (3) |
|
|
129 | (2) |
|
If we do not use rounding to nearest |
|
|
131 | (1) |
|
|
132 | (7) |
|
|
132 | (3) |
|
Dekker's multiplication algorithm |
|
|
135 | (4) |
|
|
139 | (12) |
|
|
140 | (1) |
|
Error bound for complex multiplication |
|
|
141 | (3) |
|
|
144 | (5) |
|
|
149 | (2) |
|
The Fused Multiply-Add Instruction |
|
|
151 | (30) |
|
|
152 | (1) |
|
Computation of Residuals of Division and Square Root |
|
|
153 | (2) |
|
Newton-Raphson-Based Division with an FMA |
|
|
155 | (12) |
|
Variants of the Newton--Raphson iteration |
|
|
155 | (5) |
|
Using the Newton--Raphson iteration for correctly rounded division |
|
|
160 | (7) |
|
Newton--Raphson-Based Square Root with an FMA |
|
|
167 | (4) |
|
|
167 | (1) |
|
Using the Newton--Raphson iteration for correctly rounded square roots |
|
|
168 | (3) |
|
Multiplication by an Arbitrary-Precision Constant |
|
|
171 | (4) |
|
Checking for a given constant C if Algorithm 5.2 will always work |
|
|
172 | (3) |
|
Evaluation of the Error of an FMA |
|
|
175 | (2) |
|
Evaluation of Integer Powers |
|
|
177 | (4) |
|
Enhanced Floating-Point Sums, Dot Products, and Polynomial Values |
|
|
181 | (24) |
|
|
182 | (6) |
|
Floating-point arithmetic models |
|
|
183 | (1) |
|
Notation for error analysis and classical error estimates |
|
|
184 | (3) |
|
Properties for deriving running error bounds |
|
|
187 | (1) |
|
Computing Validated Running Error Bounds |
|
|
188 | (2) |
|
Computing Sums More Accurately |
|
|
190 | (11) |
|
Reordering the operands, and a bit more |
|
|
190 | (2) |
|
|
192 | (7) |
|
Implementing a ``long accumulator'' |
|
|
199 | (1) |
|
On the sum of three floating-point numbers |
|
|
199 | (2) |
|
|
201 | (2) |
|
Compensated Polynomial Evaluation |
|
|
203 | (2) |
|
|
205 | (32) |
|
|
205 | (4) |
|
Floating-point evaluation in programming languages |
|
|
206 | (2) |
|
Processors, compilers, and operating systems |
|
|
208 | (1) |
|
In the hands of the programmer |
|
|
209 | (1) |
|
Floating Point in the C Language |
|
|
209 | (11) |
|
Standard C99 headers and IEEE 754--1985 support |
|
|
209 | (1) |
|
|
210 | (3) |
|
|
213 | (3) |
|
|
216 | (1) |
|
Enabling unsafe optimizations |
|
|
217 | (1) |
|
Summary: a few horror stories |
|
|
218 | (2) |
|
Floating-Point Arithmetic in the C++ Language |
|
|
220 | (3) |
|
|
220 | (1) |
|
|
221 | (1) |
|
|
222 | (1) |
|
FORTRAN Floating Point in a Nutshell |
|
|
223 | (4) |
|
|
223 | (3) |
|
IEEE 754 support in FORTRAN |
|
|
226 | (1) |
|
Java Floating Point in a Nutshell |
|
|
227 | (7) |
|
|
227 | (1) |
|
|
228 | (2) |
|
Infinities, NaNs, and signed zeros |
|
|
230 | (1) |
|
|
231 | (1) |
|
|
232 | (1) |
|
|
233 | (1) |
|
|
234 | (3) |
|
III Implementing Floating-Point Operators |
|
|
237 | (136) |
|
Algorithms for the Five Basic Operations |
|
|
239 | (30) |
|
Overview of Basic Operation Implementation |
|
|
239 | (2) |
|
Implementing IEEE 754-2008 Rounding |
|
|
241 | (5) |
|
Rounding a nonzero finite value with unbounded exponent range |
|
|
241 | (2) |
|
|
243 | (1) |
|
Underflow and subnormal results |
|
|
244 | (1) |
|
|
245 | (1) |
|
Rounding for actual operations |
|
|
245 | (1) |
|
Floating-Point Addition and Subtraction |
|
|
246 | (5) |
|
|
249 | (1) |
|
Decimal addition using binary encoding |
|
|
250 | (1) |
|
Subnormal inputs and outputs in binary addition |
|
|
251 | (1) |
|
Floating-Point Multiplication |
|
|
251 | (3) |
|
|
252 | (1) |
|
Handling subnormal numbers in binary multiplication |
|
|
252 | (1) |
|
|
253 | (1) |
|
Floating-Point Fused Multiply-Add |
|
|
254 | (8) |
|
Case analysis for normal inputs |
|
|
254 | (4) |
|
Handling subnormal inputs |
|
|
258 | (1) |
|
|
259 | (1) |
|
Overview of a binary FMA implementation |
|
|
259 | (3) |
|
|
262 | (3) |
|
Overview and special cases |
|
|
262 | (1) |
|
Computing the significand quotient |
|
|
263 | (1) |
|
Managing subnormal numbers |
|
|
264 | (1) |
|
|
265 | (1) |
|
|
265 | (1) |
|
Floating-Point Square Root |
|
|
265 | (4) |
|
Overview and special cases |
|
|
265 | (1) |
|
Computing the significand square root |
|
|
266 | (1) |
|
Managing subnormal numbers |
|
|
267 | (1) |
|
|
267 | (1) |
|
|
267 | (2) |
|
Hardware Implementation of Floating-Point Arithmetic |
|
|
269 | (52) |
|
|
269 | (5) |
|
Processor internal formats |
|
|
269 | (1) |
|
Hardware handling of subnormal numbers |
|
|
270 | (1) |
|
Full-custom VLSI versus reconfigurable circuits |
|
|
271 | (1) |
|
Hardware decimal arithmetic |
|
|
272 | (1) |
|
|
273 | (1) |
|
The Primitives and Their Cost |
|
|
274 | (14) |
|
|
274 | (6) |
|
Digit-by-integer multiplication in hardware |
|
|
280 | (1) |
|
Using nonstandard representations of numbers |
|
|
280 | (1) |
|
Binary integer multiplication |
|
|
281 | (2) |
|
Decimal integer multiplication |
|
|
283 | (1) |
|
|
284 | (1) |
|
|
284 | (2) |
|
Tables and table-based methods for fixed-point function approximation |
|
|
286 | (2) |
|
Binary Floating-Point Addition |
|
|
288 | (8) |
|
|
288 | (1) |
|
A first dual-path architecture |
|
|
289 | (2) |
|
Leading-zero anticipation |
|
|
291 | (4) |
|
Probing further on floating-point adders |
|
|
295 | (1) |
|
Binary Floating-Point Multiplication |
|
|
296 | (6) |
|
|
296 | (1) |
|
|
296 | (2) |
|
VLSI implementation optimized for delay |
|
|
298 | (3) |
|
|
301 | (1) |
|
Binary Fused Multiply-Add |
|
|
302 | (3) |
|
|
303 | (2) |
|
|
305 | (1) |
|
|
305 | (4) |
|
Digit-recurrence division |
|
|
306 | (3) |
|
|
309 | (1) |
|
Conclusion: Beyond the FPU |
|
|
309 | (11) |
|
Optimization in context of standard operators |
|
|
310 | (1) |
|
Operation with a constant operand |
|
|
311 | (2) |
|
|
313 | (1) |
|
Specific architectures for accumulation |
|
|
313 | (4) |
|
|
317 | (3) |
|
|
320 | (1) |
|
Software Implementation of Floating-Point Arithmetic |
|
|
321 | (52) |
|
|
322 | (7) |
|
Standard encoding of binary floating-point data |
|
|
322 | (1) |
|
Available integer operators |
|
|
323 | (3) |
|
|
326 | (2) |
|
Design choices and optimizations |
|
|
328 | (1) |
|
Binary Floating-Point Addition |
|
|
329 | (12) |
|
|
330 | (2) |
|
Computing the sign of the result |
|
|
332 | (1) |
|
Swapping the operands and computing the alignment shift |
|
|
333 | (2) |
|
Getting the correctly rounded result |
|
|
335 | (6) |
|
Binary Floating-Point Multiplication |
|
|
341 | (8) |
|
|
341 | (2) |
|
Sign and exponent computation |
|
|
343 | (2) |
|
|
345 | (1) |
|
Getting the correctly rounded result |
|
|
346 | (3) |
|
Binary Floating-Point Division |
|
|
349 | (12) |
|
|
350 | (1) |
|
Sign and exponent computation |
|
|
351 | (3) |
|
|
354 | (1) |
|
Getting the correctly rounded result |
|
|
355 | (6) |
|
Binary Floating-Point Square Root |
|
|
361 | (12) |
|
|
362 | (2) |
|
|
364 | (1) |
|
Getting the correctly rounded result |
|
|
365 | (8) |
|
|
373 | (85) |
|
Evaluating Floating-Point Elementary Functions |
|
|
375 | (30) |
|
Basic Range Reduction Algorithms |
|
|
379 | (3) |
|
Cody and Waite's reduction algorithm |
|
|
379 | (2) |
|
Payne and Hanek's algorithm |
|
|
381 | (1) |
|
Bounding the Relative Error of Range Reduction |
|
|
382 | (2) |
|
More Sophisticated Range Reduction Algorithms |
|
|
384 | (4) |
|
An example of range reduction for the exponential function |
|
|
386 | (1) |
|
An example of range reduction for the logarithm |
|
|
387 | (1) |
|
Polynomial or Rational Approximations |
|
|
388 | (5) |
|
|
389 | (1) |
|
|
390 | (2) |
|
``Truncated'' approximations |
|
|
392 | (1) |
|
|
393 | (1) |
|
Correct Rounding of Elementary Functions to binary64 |
|
|
394 | (8) |
|
The Table Maker's Dilemma and Ziv's onion peeling strategy |
|
|
394 | (1) |
|
|
395 | (1) |
|
|
396 | (4) |
|
|
400 | (1) |
|
Error analysis and the accuracy/performance tradeoff |
|
|
401 | (1) |
|
|
402 | (3) |
|
The point with efficient code |
|
|
402 | (1) |
|
Example: a ``double-double'' polynomial evaluation |
|
|
403 | (2) |
|
Solving the Table Maker's Dilemma |
|
|
405 | (53) |
|
|
405 | (7) |
|
The Table Maker's Dilemma |
|
|
406 | (4) |
|
|
410 | (1) |
|
Organization of the chapter |
|
|
411 | (1) |
|
Preliminary Remarks on the Table Maker's Dilemma |
|
|
412 | (8) |
|
Statistical arguments: what can be expected in practice |
|
|
412 | (4) |
|
In some domains, there is no need to find worst cases |
|
|
416 | (3) |
|
Deducing the worst cases from other functions or domains |
|
|
419 | (1) |
|
The Table Maker's Dilemma for Algebraic Functions |
|
|
420 | (5) |
|
Algebraic and transcendental numbers and functions |
|
|
420 | (2) |
|
The elementary case of quotients |
|
|
422 | (2) |
|
Around Liouville's theorem |
|
|
424 | (1) |
|
Generating bad rounding cases for the square root using Hensel 2-adic lifting |
|
|
425 | (23) |
|
Solving the Table Maker's Dilemma for Arbitrary Functions |
|
|
429 | (1) |
|
Lindemann's theorem: application to some transcendental functions |
|
|
429 | (1) |
|
A theorem of Nesterenko and Waldschmidt |
|
|
430 | (2) |
|
A first method: tabulated differences |
|
|
432 | (2) |
|
From the TMD to the distance between a grid and a segment |
|
|
434 | (2) |
|
Linear approximation: Lefevre's algorithm |
|
|
436 | (7) |
|
|
443 | (5) |
|
Periodic functions on large arguments |
|
|
448 | (10) |
|
|
449 | (1) |
|
Worst cases for the exponential, logarithmic, trigonometric, and hyperbolic functions |
|
|
449 | (9) |
|
A special case: integer powers |
|
|
458 | (1) |
|
V Current Limits and Perspectives |
|
|
458 | (59) |
|
|
461 | (27) |
|
Formalisms for Certifying Floating-Point Algorithms |
|
|
463 | (7) |
|
Formalizing Floating-Point Arithmetic |
|
|
463 | (1) |
|
Defining floating-point numbers |
|
|
464 | (2) |
|
Simplifying the definition |
|
|
466 | (1) |
|
Defining rounding operators |
|
|
467 | (3) |
|
Extending the set of numbers |
|
|
470 | (3) |
|
Formalisms for Certifying Algorithms by Hand |
|
|
471 | (1) |
|
|
471 | (1) |
|
|
472 | (1) |
|
|
473 | (10) |
|
|
474 | (1) |
|
|
475 | (2) |
|
|
477 | (2) |
|
|
479 | (4) |
|
Handling the relative error |
|
|
483 | (5) |
|
|
484 | (1) |
|
Toy implementation of sine |
|
|
484 | (4) |
|
Integer division on Itanium |
|
|
488 | (29) |
|
|
493 | (7) |
|
Double-Words, Triple-Words |
|
|
494 | (1) |
|
|
495 | (3) |
|
Static triple-word arithmetic |
|
|
498 | (2) |
|
|
500 | (3) |
|
Floating-Point Expansions |
|
|
503 | (6) |
|
Floating-Point Numbers with Batched Additional Exponent |
|
|
509 | (8) |
|
Large Precision Relying on Processor Integers |
|
|
510 | (2) |
|
Using arbitrary-precision integer arithmetic for arbitrary-precision floating-point arithmetic |
|
|
512 | (1) |
|
A brief introduction to arbitrary-precision integer arithmetic |
|
|
513 | (4) |
|
VI Perspectives and Appendix |
|
|
517 | (12) |
|
Conclusion and Perspectives |
|
|
519 | (2) |
|
Appendix: Number Theory Tools for Floating-Point Arithmetic |
|
|
521 | (8) |
|
|
521 | (3) |
|
|
524 | (5) |
Bibliography |
|
529 | (38) |
Index |
|
567 | |