Atnaujinkite slapukų nuostatas
  • Formatas: PDF+DRM
  • Išleidimo metai: 02-May-2018
  • Leidėjas: Birkhauser Verlag AG
  • Kalba: eng
  • ISBN-13: 9783319765266

DRM apribojimai

  • Kopijuoti:

    neleidžiama

  • Spausdinti:

    neleidžiama

  • El. knygos naudojimas:

    Skaitmeninių teisių valdymas (DRM)
    Leidykla pateikė šią knygą šifruota forma, o tai reiškia, kad norint ją atrakinti ir perskaityti reikia įdiegti nemokamą programinę įrangą. Norint skaityti šią el. knygą, turite susikurti Adobe ID . Daugiau informacijos  čia. El. knygą galima atsisiųsti į 6 įrenginius (vienas vartotojas su tuo pačiu Adobe ID).

    Reikalinga programinė įranga
    Norint skaityti šią el. knygą mobiliajame įrenginyje (telefone ar planšetiniame kompiuteryje), turite įdiegti šią nemokamą programėlę: PocketBook Reader (iOS / Android)

    Norint skaityti šią el. knygą asmeniniame arba „Mac“ kompiuteryje, Jums reikalinga  Adobe Digital Editions “ (tai nemokama programa, specialiai sukurta el. knygoms. Tai nėra tas pats, kas „Adobe Reader“, kurią tikriausiai jau turite savo kompiuteryje.)

    Negalite skaityti šios el. knygos naudodami „Amazon Kindle“.

This handbook is a definitive guide to the effective use of modern floating-point arithmetic, which has considerably evolved, from the frequently inconsistent floating-point number systems of early computing to the recent IEEE 754-2008 standard. Most of computational mathematics depends on floating-point numbers, and understanding their various implementations will allow readers to develop programs specifically tailored for the standards technical features. Algorithms for floating-point arithmetic are presented throughout the book and illustrated where possible by example programs which show how these techniques appear in actual coding and design. The volume itself breaks its core topic into four parts: the basic concepts and history of floating-point arithmetic; methods of analyzing floating-point algorithms and optimizing them; implementations of IEEE 754-2008 in hardware and software; and useful extensions to the standard floating-point system, such as interval arithmetic, double- and triple-word arithmetic, operations on complex numbers, and formal verification of floating-point algorithms. This new edition updates chapters to reflect recent changes to programming languages and compilers and the new prevalence of GPUs in recent years. The revisions also add material on fused multiply-add instruction, and methods of extending the floating-point precision.  As supercomputing becomes more common, more numerical engineers will need to use number representation to account for trade-offs between various parameters, such as speed, accuracy, and energy consumption. The Handbook of Floating-Point Arithmetic is designed for students and researchers in numerical analysis, programmers of numerical algorithms, compiler designers, and designers of arithmetic operators. 

Recenzijos

The new edition of this book updates chapters to reflect recent changes to programming languages and compilers and the new prevalence of Graphic Processing Units in recent years. In the Appendix, the reader will find an introduction to relevant number theory tools . This book is designed for programmers of numerical applications and more generally students and researchers in numerical analysis who wish to more accurately understand a tool that they manipulate on an everyday basis. (T. C. Mohan, zbMATH 1394.65001, 2018)

List of Figures
xv
List of Tables
xix
Preface xxiii
Part I Introduction, Basic Definitions, and Standards
1(94)
1 Introduction
3(12)
1.1 Some History
4(3)
1.2 Desirable Properties
7(1)
1.3 Some Strange Behaviors
8(7)
1.3.1 Some famous bugs
8(1)
1.3.2 Difficult problems
9(6)
2 Definitions and Basic Notions
15(32)
2.1 Floating-Point Numbers
15(7)
2.1.1 Main definitions
15(2)
2.1.2 Normalized representations, normal and subnormal numbers
17(2)
2.1.3 A note on underflow
19(2)
2.1.4 Special floating-point data
21(1)
2.2 Rounding
22(3)
2.2.1 Rounding functions
22(2)
2.2.2 Useful properties
24(1)
2.3 Tools for Manipulating Floating-Point Errors
25(12)
2.3.1 Relative error due to rounding
25(4)
2.3.2 The ulp function
29(5)
2.3.3 Link between errors in ulps and relative errors
34(1)
2.3.4 An example: iterated products
35(2)
2.4 The Fused Multiply-Add (FMA) Instruction
37(1)
2.5 Exceptions
37(3)
2.6 Lost and Preserved Properties of Real Arithmetic
40(1)
2.7 Note on the Choice of the Radix
41(3)
2.7.1 Representation errors
41(2)
2.7.2 A case for radix 10
43(1)
2.8 Reproducibility
44(3)
3 Floating-Point Formats and Environment
47(48)
3.1 The IEEE 754-2008 Standard
48(31)
3.1.1 Formats
48(18)
3.1.2 Attributes and rounding
66(4)
3.1.3 Operations specified by the standard
70(2)
3.1.4 Comparisons
72(1)
3.1.5 Conversions to/from string representations
73(1)
3.1.6 Default exception handling
74(3)
3.1.7 Special values
77(2)
3.1.8 Recommended functions
79(1)
3.2 On the Possible Hidden Use of a Higher Internal Precision
79(3)
3.3 Revision of the IEEE 754-2008 Standard
82(1)
3.4 Floating-Point Hardware in Current Processors
83(6)
3.4.1 The common hardware denominator
83(1)
3.4.2 Fused multiply-add
84(1)
3.4.3 Extended precision and 128-bit formats
85(1)
3.4.4 Rounding and precision control
85(1)
3.4.5 SIMD instructions
86(1)
3.4.6 Binaryl6 (half-precision) support
87(1)
3.4.7 Decimal arithmetic
87(1)
3.4.8 The legacy x87 processor
88(1)
3.5 Floating-Point Hardware in Recent Graphics Processing Units
89(1)
3.6 IEEE Support in Programming Languages
90(1)
3.7 Checking the Environment
91(4)
3.7.1 MACHAR
91(1)
3.7.2 Paranoia
92(1)
3.7.3 UCBTest
92(1)
3.7.4 TestFloat
92(1)
3.7.5 Miscellaneous
93(2)
Part II Cleverly Using Floating-Point Arithmetic
95(136)
4 Basic Properties and Algorithms
97(66)
4.1 Testing the Computational Environment
97(3)
4.1.1 Computing the radix
97(2)
4.1.2 Computing the precision
99(1)
4.2 Exact Operations
100(3)
4.2.1 Exact addition
100(3)
4.2.2 Exact multiplications and divisions
103(1)
4.3 Accurate Computation of the Sum of Two Numbers
103(8)
4.3.1 The Fast2Sum algorithm
104(3)
4.3.2 The 2Sum algorithm
107(2)
4.3.3 If we do not use rounding to nearest
109(2)
4.4 Accurate Computation of the Product of Two Numbers
111(9)
4.4.1 The 2MultFMA Algorithm
112(1)
4.4.2 If no FMA instruction is available: Veltkamp splitting and Dekker product
113(7)
4.5 Computation of Residuals of Division and Square Root with an FMA
120(3)
4.6 Another splitting technique: splitting around a power of 2
123(1)
4.7 Newton-Raphson-Based Division with an FMA
124(14)
4.7.1 Variants of the Newton-Raphson iteration
124(5)
4.7.2 Using the Newton-Raphson iteration for correctly rounded division with an FMA
129(7)
4.7.3 Possible double roundings in division algorithms
136(2)
4.8 Newton-Raphson-Based Square Root with an FMA
138(4)
4.8.1 The basic iterations
138(1)
4.8.2 Using the Newton-Raphson iteration for correctly rounded square roots
138(4)
4.9 Radix Conversion
142(11)
4.9.1 Conditions on the formats
142(4)
4.9.2 Conversion algorithms
146(7)
4.10 Conversion Between Integers and Floating-Point Numbers
153(3)
4.10.1 From 32-bit integers to floating-point numbers
153(1)
4.10.2 From 64-bit integers to floating-point numbers
154(1)
4.10.3 From floating-point numbers to integers
155(1)
4.11 Multiplication by an Arbitrary-Precision Constant with an FMA
156(4)
4.12 Evaluation of the Error of an FMA
160(3)
5 Enhanced Floating-Point Sums, Dot Products, and Polynomial Values
163(30)
5.1 Preliminaries
164(11)
5.1.1 Floating-point arithmetic models
165(1)
5.1.2 Notation for error analysis and classical error estimates
166(3)
5.1.3 Some refined error estimates
169(5)
5.1.4 Properties for deriving validated running error bounds
174(1)
5.2 Computing Validated Running Error Bounds
175(2)
5.3 Computing Sums More Accurately
177(12)
5.3.1 Reordering the operands, and a bit more
177(1)
5.3.2 Compensated sums
178(6)
5.3.3 Summation algorithms that somehow imitate a fixed-point arithmetic
184(3)
5.3.4 On the sum of three floating-point numbers
187(2)
5.4 Compensated Dot Products
189(1)
5.5 Compensated Polynomial Evaluation
190(3)
6 Languages and Compilers
193(38)
6.1 A Play with Many Actors
193(7)
6.1.1 Floating-point evaluation in programming languages
194(2)
6.1.2 Processors, compilers, and operating systems
196(1)
6.1.3 Standardization processes
197(2)
6.1.4 In the hands of the programmer
199(1)
6.2 Floating Point in the C Language
200(15)
6.2.1 Standard C11 headers and IEEE 754-1985 support
201(1)
6.2.2 Types
202(2)
6.2.3 Expression evaluation
204(5)
6.2.4 Code transformations
209(1)
6.2.5 Enabling unsafe optimizations
210(1)
6.2.6 Summary: a few horror stories
211(3)
6.2.7 The CompCert C compiler
214(1)
6.3 Floating-Point Arithmetic in the C++ Language
215(3)
6.3.1 Semantics
215(1)
6.3.2 Numeric limits
215(2)
6.3.3 Overloaded functions
217(1)
6.4 FORTRAN Floating Point in a Nutshell
218(4)
6.4.1 Philosophy
218(2)
6.4.2 IEEE 754 support in FORTRAN
220(2)
6.5 Java Floating Point in a Nutshell
222(7)
6.5.1 Philosophy
222(1)
6.5.2 Types and classes
222(2)
6.5.3 Infinities, NaNs, and signed zeros
224(1)
6.5.4 Missing features
225(1)
6.5.5 Reproducibility
226(1)
6.5.6 The BigDecimal package
227(2)
6.6 Conclusion
229(2)
Part III Implementing Floating-Point Operators
231(204)
7 Algorithms for the Basic Operations
233(34)
7.1 Overview of Basic Operation Implementation
233(2)
7.2 Implementing IEEE 754-2008 Rounding
235(5)
7.2.1 Rounding a nonzero finite value with unbounded exponent range
235(3)
7.2.2 Overflow
238(1)
7.2.3 Underflow and subnormal results
239(1)
7.2.4 The inexact exception
239(1)
7.2.5 Rounding for actual operations
240(1)
7.3 Floating-Point Addition and Subtraction
240(5)
7.3.1 Decimal addition
244(1)
7.3.2 Decimal addition using binary encoding
245(1)
7.3.3 Subnormal inputs and outputs in binary addition
245(1)
7.4 Floating-Point Multiplication
245(3)
7.4.1 Normal case
246(1)
7.4.2 Handling subnormal numbers in binary multiplication
247(1)
7.4.3 Decimal specifics
248(1)
7.5 Floating-Point Fused Multiply-Add
248(8)
7.5.1 Case analysis for normal inputs
249(3)
7.5.2 Handling subnormal inputs
252(1)
7.5.3 Handling decimal cohorts
253(1)
7.5.4 Overview of a binary FMA implementation
254(2)
7.6 Floating-Point Division
256(3)
7.6.1 Overview and special cases
256(1)
7.6.2 Computing the significand quotient
257(1)
7.6.3 Managing subnormal numbers
258(1)
7.6.4 The inexact exception
259(1)
7.6.5 Decimal specifics
259(1)
7.7 Floating-Point Square Root
259(2)
7.7.1 Overview and special cases
259(1)
7.7.2 Computing the significand square root
260(1)
7.7.3 Managing subnormal numbers
260(1)
7.7.4 The inexact exception
261(1)
7.7.5 Decimal specifics
261(1)
7.8 Nonhomogeneous Operators
261(6)
7.8.1 A software algorithm around double rounding
262(2)
7.8.2 The mixed-precision fused multiply-and-add
264(1)
7.8.3 Motivation
265(1)
7.8.4 Implementation issues
265(2)
8 Hardware Implementation of Floating-Point Arithmetic
267(54)
8.1 Introduction and Context
267(5)
8.1.1 Processor internal formats
267(1)
8.1.2 Hardware handling of subnormal numbers
268(1)
8.1.3 Full-custom VLSI versus reconfigurable circuits (FPGAs)
269(1)
8.1.4 Hardware decimal arithmetic
270(1)
8.1.5 Pipelining
271(1)
8.2 The Primitives and Their Cost
272(15)
8.2.1 Integer adders
272(6)
8.2.2 Digit-by-integer multiplication in hardware
278(1)
8.2.3 Using nonstandard representations of numbers
278(2)
8.2.4 Binary integer multiplication
280(1)
8.2.5 Decimal integer multiplication
280(2)
8.2.6 Shifters
282(1)
8.2.7 Leading-zero counters
283(2)
8.2.8 Tables and table-based methods for fixed-point function approximation
285(2)
8.3 Binary Floating-Point Addition
287(8)
8.3.1 Overview
287(1)
8.3.2 A first dual-path architecture
288(2)
8.3.3 Leading-zero anticipation
290(4)
8.3.4 Probing further on floating-point adders
294(1)
8.4 Binary Floating-Point Multiplication
295(6)
8.4.1 Basic architecture
295(1)
8.4.2 FPGA implementation
296(1)
8.4.3 VLSI implementation optimized for delay
297(3)
8.4.4 Managing subnormals
300(1)
8.5 Binary Fused Multiply-Add
301(3)
8.5.1 Classic architecture
301(1)
8.5.2 To probe further
302(2)
8.6 Division and Square Root
304(5)
8.6.1 Digit-recurrence division
305(3)
8.6.2 Decimal division
308(1)
8.7 Beyond the Classical Floating-Point Unit
309(3)
8.7.1 More fused operators
309(1)
8.7.2 Exact accumulation and dot product
309(2)
8.7.3 Hardware-accelerated compensated algorithms
311(1)
8.8 Floating-Point for FPGAs
312(8)
8.8.1 Optimization in context of standard operators
312(2)
8.8.2 Operations with a constant operand
314(1)
8.8.3 Computing large floating-point sums
315(4)
8.8.4 Block floating point
319(1)
8.8.5 Algebraic operators
319(1)
8.8.6 Elementary and compound functions
320(1)
8.9 Probing Further
320(1)
9 Software Implementation of Floating-Point Arithmetic
321(54)
9.1 Implementation Context
322(7)
9.1.1 Standard encoding of binary floating-point data
322(1)
9.1.2 Available integer operators
323(3)
9.1.3 First examples
326(2)
9.1.4 Design choices and optimizations
328(1)
9.2 Binary Floating-Point Addition
329(12)
9.2.1 Handling special values
330(2)
9.2.2 Computing the sign of the result
332(1)
9.2.3 Swapping the operands and computing the alignment shift
333(2)
9.2.4 Getting the correctly rounded result
335(6)
9.3 Binary Floating-Point Multiplication
341(8)
9.3.1 Handling special values
341(2)
9.3.2 Sign and exponent computation
343(2)
9.3.3 Overflow detection
345(1)
9.3.4 Getting the correctly rounded result
346(3)
9.4 Binary Floating-Point Division
349(13)
9.4.1 Handling special values
350(1)
9.4.2 Sign and exponent computation
351(3)
9.4.3 Overflow detection
354(1)
9.4.4 Getting the correctly rounded result
355(7)
9.5 Binary Floating-Point Square Root
362(10)
9.5.1 Handling special values
362(1)
9.5.2 Exponent computation
363(2)
9.5.3 Getting the correctly rounded result
365(7)
9.6 Custom Operators
372(3)
10 Evaluating Floating-Point Elementary Functions
375(60)
10.1 Introduction
375(4)
10.1.1 Which accuracy?
375(1)
10.1.2 The various steps of function evaluation
376(3)
10.2 Range Reduction
379(10)
10.2.1 Basic range reduction algorithms
379(3)
10.2.2 Bounding the relative error of range reduction
382(1)
10.2.3 More sophisticated range reduction algorithms
383(3)
10.2.4 Examples
386(3)
10.3 Polynomial Approximations
389(7)
10.3.1 L2 case
390(1)
10.3.2 L∞, or minimax, case
391(3)
10.3.3 "Truncated" approximations
394(1)
10.3.4 In practice: using the Sollya tool to compute constrained approximations and certified error bounds
394(2)
10.4 Evaluating Polynomials
396(1)
10.4.1 Evaluation strategies
396(1)
10.4.2 Evaluation error
397(1)
10.5 The Table Maker's Dilemma
397(30)
10.5.1 When there is no need to solve the TMD
400(1)
10.5.2 On breakpoints
400(4)
10.5.3 Finding the hardest-to-round points
404(23)
10.6 Some Implementation Tricks Used in the CRlibm Library
427(8)
10.6.1 Rounding test
428(1)
10.6.2 Accurate second step
429(1)
10.6.3 Error analysis and the accuracy/performance tradeoff
429(1)
10.6.4 The point with efficient code
430(5)
Part IV Extensions
435(118)
11 Complex Numbers
437(16)
11.1 Introduction
437(2)
11.2 Componentwise and Normwise Errors
439(1)
11.3 Computing ad ± be with an FMA
440(2)
11.4 Complex Multiplication
442(1)
11.4.1 Complex multiplication without an FMA instruction
442(1)
11.4.2 Complex multiplication with an FMA instruction
442(1)
11.5 Complex Division
443(4)
11.5.1 Error bounds for complex division
443(1)
11.5.2 Scaling methods for avoiding over-/underflow in complex division
444(3)
11.6 Complex Absolute Value
447(2)
11.6.1 Error bounds for complex absolute value
447(1)
11.6.2 Scaling for the computation of complex absolute value
447(2)
11.7 Complex Square Root
449(2)
11.7.1 Error bounds for complex square root
449(1)
11.7.2 Scaling techniques for complex square root
450(1)
11.8 An Alternative Solution: Exception Handling
451(2)
12 Interval Arithmetic
453(26)
12.1 Introduction to Interval Arithmetic
454(3)
12.1.1 Definitions and the inclusion property
454(2)
12.1.2 Loss of algebraic properties
456(1)
12.2 The IEEE 1788-2015 Standard for Interval Arithmetic
457(8)
12.2.1 Structuration into levels
457(1)
12.2.2 Flavors
458(2)
12.2.3 Decorations
460(3)
12.2.4 Level 2: discretization issues
463(1)
12.2.5 Exact dot product
464(1)
12.2.6 Levels 3 and 4: implementation issues
464(1)
12.2.7 Libraries implementing IEEE 1788-2015
464(1)
12.3 Intervals with Floating-Point Bounds
465(3)
12.3.1 Implementation using floating-point arithmetic
465(1)
12.3.2 Difficulties
465(2)
12.3.3 Optimized rounding
467(1)
12.4 Interval Arithmetic and Roundoff Error Analysis
468(8)
12.4.1 Influence of the computing precision
468(3)
12.4.2 A more efficient approach: the mid-rad representation
471(4)
12.4.3 Variants: affine arithmetic, Taylor models
475(1)
12.5 Further Readings
476(3)
13 Verifying Floating-Point Algorithms
479(34)
13.1 Formalizing Floating-Point Arithmetic
479(8)
13.1.1 Defining floating-point numbers
480(2)
13.1.2 Simplifying the definition
482(1)
13.1.3 Defining rounding operators
483(3)
13.1.4 Extending the set of numbers
486(1)
13.2 Formalisms for Verifying Algorithms
487(6)
13.2.1 Hardware units
487(2)
13.2.2 Floating-point algorithms
489(1)
13.2.3 Automating proofs
490(3)
13.3 Roundoff Errors and the Gappa Tool
493(20)
13.3.1 Computing on bounds
494(2)
13.3.2 Counting digits
496(2)
13.3.3 Manipulating expressions
498(4)
13.3.4 Handling the relative error
502(1)
13.3.5 Example: toy implementation of sine
503(5)
13.3.6 Example: integer division on Itanium
508(5)
14 Extending the Precision
513(40)
14.1 Double-Words, Triple-Words
514(8)
14.1.1 Double-word arithmetic
515(5)
14.1.2 Static triple-word arithmetic
520(2)
14.2 Floating-Point Expansions
522(12)
14.2.1 Renormalization of floating-point expansions
525(1)
14.2.2 Addition of floating-point expansions
526(2)
14.2.3 Multiplication of floating-point expansions
528(3)
14.2.4 Division of floating-point expansions
531(3)
14.3 Floating-Point Numbers with Batched Additional Exponent
534(1)
14.4 Large Precision Based on a High-Radix Representation
535(18)
14.4.1 Specifications
536(1)
14.4.2 Using arbitrary-precision integer arithmetic for arbitrary-precision floating-point arithmetic
537(1)
14.4.3 A brief introduction to arbitrary-precision integer arithmetic
538(3)
14.4.4 GNUMPFR
541(12)
Appendix A Number Theory Tools for Floating-Point Arithmetic 553(8)
A.1 Continued Fractions
553(3)
A.2 Euclidean Lattices
556(5)
Appendix B Previous Floating-Point Standards 561(8)
B.1 The IEEE 754-1985 Standard
561(3)
B.1.1 Formats specified by IEEE 754-1985
561(1)
B.1.2 Rounding modes specified by IEEE 754-1985
562(1)
B.1.3 Operations specified by IEEE 754-1985
562(1)
B.1.4 Exceptions specified by IEEE 754-1985
563(1)
B.2 The IEEE 854-1987 Standard
564(2)
B.2.1 Constraints internal to a format
564(1)
B.2.2 Various formats and the constraints between them
565(1)
B.2.3 Rounding
566(1)
B.2.4 Operations
566(1)
B.2.5 Comparisons
566(1)
B.2.6 Exceptions
566(1)
B.3 The Need for a Revision
566(1)
B.4 The IEEE 754-2008 Revision
567(2)
Bibliography 569(52)
Index 621
Jean-Michel Muller (coordinator), CNRS, Laboratoire LIP, AriC teamNicolas Brunie, KalrayFlorent de Dinechin, INSA Lyon, Laboratoire CITI, Socrate teamClaude-Pierre Jeannerod, Inria, Laboratoire LIP, AriC teamMioara Joldes, CNRS, LAAS, MAC teamVincent Lefčvre, Inria, Laboratoire LIP, AriC teamGuillaume Melquiond, Inria, Laboratoire LRI, Toccata teamNathalie Revol, Inria, Laboratoire LIP, AriC teamSerge Torres, ENS de Lyon, Laboratoire LIP, AriC team