|
|
|
xv | |
|
|
|
xxi | |
|
|
|
xxv | |
| Foreword |
|
xxix | |
| Preface |
|
xxxiii | |
| Acknowledgments |
|
xli | |
| About the Authors |
|
xliii | |
|
Part I The OpenCL 1.1 Language and API |
|
|
1 | (390) |
|
1 An Introduction to OpenCL |
|
|
3 | (36) |
|
What Is OpenCL, or ... Why You Need This Book |
|
|
3 | (1) |
|
Our Many-Core Future: Heterogeneous Platforms |
|
|
4 | (3) |
|
Software in a Many-Core World |
|
|
7 | (4) |
|
Conceptual Foundations of OpenCL |
|
|
11 | (18) |
|
|
|
12 | (1) |
|
|
|
13 | (8) |
|
|
|
21 | (3) |
|
|
|
24 | (5) |
|
|
|
29 | (1) |
|
|
|
30 | (5) |
|
|
|
31 | (1) |
|
|
|
31 | (1) |
|
Kernel Programming Language |
|
|
32 | (2) |
|
|
|
34 | (1) |
|
|
|
35 | (1) |
|
|
|
36 | (3) |
|
2 HelloWorld: An OpenCL Example |
|
|
39 | (24) |
|
|
|
40 | (5) |
|
|
|
40 | (1) |
|
Mac OS X and Code::Blocks |
|
|
41 | (1) |
|
Microsoft Windows and Visual Studio |
|
|
42 | (2) |
|
|
|
44 | (1) |
|
|
|
45 | (12) |
|
Choosing an OpenCL Platform and Creating a Context |
|
|
49 | (1) |
|
Choosing a Device and Creating a Command-Queue |
|
|
50 | (2) |
|
Creating and Building a Program Object |
|
|
52 | (2) |
|
Creating Kernel and Memory Objects |
|
|
54 | (1) |
|
|
|
55 | (2) |
|
Checking for Errors in OpenCL |
|
|
57 | (6) |
|
3 Platforms, Contexts, and Devices |
|
|
63 | (34) |
|
|
|
63 | (5) |
|
|
|
68 | (15) |
|
|
|
83 | (14) |
|
4 Programming with OpenCL C |
|
|
97 | (52) |
|
Writing a Data-Parallel Kernel Using OpenCL C |
|
|
97 | (2) |
|
|
|
99 | (3) |
|
|
|
101 | (1) |
|
|
|
102 | (6) |
|
|
|
104 | (2) |
|
|
|
106 | (2) |
|
|
|
108 | (1) |
|
|
|
109 | (1) |
|
Implicit Type Conversions |
|
|
110 | (6) |
|
Usual Arithmetic Conversions |
|
|
114 | (2) |
|
|
|
116 | (1) |
|
|
|
117 | (4) |
|
Reinterpreting Data as Another Type |
|
|
121 | (2) |
|
|
|
123 | (10) |
|
|
|
124 | (3) |
|
Relational and Equality Operators |
|
|
127 | (1) |
|
|
|
127 | (1) |
|
|
|
128 | (1) |
|
|
|
129 | (1) |
|
|
|
129 | (2) |
|
|
|
131 | (1) |
|
|
|
132 | (1) |
|
|
|
133 | (8) |
|
|
|
133 | (1) |
|
Kernel Attribute Qualifiers |
|
|
134 | (1) |
|
|
|
135 | (5) |
|
|
|
140 | (1) |
|
|
|
141 | (1) |
|
|
|
141 | (1) |
|
Preprocessor Directives and Macros |
|
|
141 | (5) |
|
|
|
143 | (2) |
|
|
|
145 | (1) |
|
|
|
146 | (3) |
|
5 OpenCL C Built-in Functions |
|
|
149 | (68) |
|
|
|
150 | (3) |
|
|
|
153 | (15) |
|
|
|
162 | (1) |
|
|
|
162 | (1) |
|
|
|
163 | (5) |
|
|
|
168 | (4) |
|
|
|
172 | (3) |
|
|
|
175 | (1) |
|
|
|
175 | (6) |
|
Vector Data Load and Store Functions |
|
|
181 | (9) |
|
Synchronization Functions |
|
|
190 | (1) |
|
Async Copy and Prefetch Functions |
|
|
191 | (4) |
|
|
|
195 | (4) |
|
Miscellaneous Vector Functions |
|
|
199 | (2) |
|
Image Read and Write Functions |
|
|
201 | (16) |
|
|
|
201 | (5) |
|
|
|
206 | (3) |
|
Determining the Border Color |
|
|
209 | (1) |
|
|
|
210 | (4) |
|
Querying Image Information |
|
|
214 | (3) |
|
|
|
217 | (30) |
|
Program and Kernel Object Overview |
|
|
217 | (1) |
|
|
|
218 | (19) |
|
Creating and Building Programs |
|
|
218 | (4) |
|
|
|
222 | (5) |
|
Creating Programs from Binaries |
|
|
227 | (9) |
|
Managing and Querying Programs |
|
|
236 | (1) |
|
|
|
237 | (10) |
|
Creating Kernel Objects and Setting Kernel Arguments |
|
|
237 | (4) |
|
|
|
241 | (1) |
|
Managing and Querying Kernels |
|
|
242 | (5) |
|
7 Buffers and Sub-Buffers |
|
|
247 | (34) |
|
Memory Objects, Buffers, and Sub-Buffers Overview |
|
|
247 | (2) |
|
Creating Buffers and Sub-Buffers |
|
|
249 | (8) |
|
Querying Buffers and Sub-Buffers |
|
|
257 | (2) |
|
Reading, Writing, and Copying Buffers and Sub-Buffers |
|
|
259 | (17) |
|
Mapping Buffers and Sub-Buffers |
|
|
276 | (5) |
|
|
|
281 | (28) |
|
Image and Sampler Object Overview |
|
|
281 | (2) |
|
|
|
283 | (9) |
|
|
|
287 | (4) |
|
Querying for Image Support |
|
|
291 | (1) |
|
|
|
292 | (3) |
|
OpenCL C Functions for Working with Images |
|
|
295 | (4) |
|
Transferring Image Objects |
|
|
299 | (10) |
|
|
|
309 | (26) |
|
Commands, Queues, and Events Overview |
|
|
309 | (2) |
|
Events and Command-Queues |
|
|
311 | (6) |
|
|
|
317 | (4) |
|
Generating Events on the Host |
|
|
321 | (1) |
|
Events Impacting Execution on the Host |
|
|
322 | (5) |
|
Using Events for Profiling |
|
|
327 | (5) |
|
|
|
332 | (1) |
|
Events from Outside OpenCL |
|
|
333 | (2) |
|
10 Interoperability with OpenGL |
|
|
335 | (18) |
|
OpenCL/OpenGL Sharing Overview |
|
|
335 | (1) |
|
Querying for the OpenGL Sharing Extension |
|
|
336 | (2) |
|
Initializing an OpenCL Context for OpenGL Interoperability |
|
|
338 | (1) |
|
Creating OpenCL Buffers from OpenGL Buffers |
|
|
339 | (5) |
|
Creating OpenCL Image Objects from OpenGL Textures |
|
|
344 | (3) |
|
Querying Information about OpenGL Objects |
|
|
347 | (1) |
|
Synchronization between OpenGL and OpenCL |
|
|
348 | (5) |
|
11 Interoperability with Direct3D |
|
|
353 | (16) |
|
Direct3D/OpenCL Sharing Overview |
|
|
353 | (1) |
|
Initializing an OpenCL Context for Direct3D Interoperability |
|
|
354 | (3) |
|
Creating OpenCL Memory Objects from Direct3D Buffers and Textures |
|
|
357 | (4) |
|
Acquiring and Releasing Direct3D Objects in OpenCL |
|
|
361 | (2) |
|
Processing a Direct3D Texture in OpenCL |
|
|
363 | (3) |
|
Processing D3D Vertex Data in OpenCL |
|
|
366 | (3) |
|
|
|
369 | (14) |
|
|
|
369 | (2) |
|
C++ Wrapper API Exceptions |
|
|
371 | (3) |
|
Vector Add Example Using the C++ Wrapper API |
|
|
374 | (9) |
|
Choosing an OpenCL Platform and Creating a Context |
|
|
375 | (1) |
|
Choosing a Device and Creating a Command-Queue |
|
|
376 | (1) |
|
Creating and Building a Program Object |
|
|
377 | (1) |
|
Creating Kernel and Memory Objects |
|
|
377 | (1) |
|
Executing the Vector Add Kernel |
|
|
378 | (5) |
|
13 OpenCL Embedded Profile |
|
|
383 | (8) |
|
|
|
383 | (2) |
|
|
|
385 | (1) |
|
|
|
386 | (1) |
|
Built-in Atomic Functions |
|
|
387 | (1) |
|
Mandated Minimum Single-Precision Floating-Point Capabilities |
|
|
387 | (3) |
|
Determining the Profile Supported by a Device in an OpenCL. C Program |
|
|
390 | (1) |
|
Part II OpenCL 1.1 Case Studies |
|
|
391 | (150) |
|
|
|
393 | (14) |
|
Computing an Image Histogram |
|
|
393 | (2) |
|
Parallelizing the Image Histogram |
|
|
395 | (5) |
|
Additional Optimizations to the Parallel Image Histogram |
|
|
400 | (3) |
|
Computing Histograms with Half-Float or Float Values for Each Channel |
|
|
403 | (4) |
|
15 Sobel Edge Detection Filter |
|
|
407 | (4) |
|
What Is a Sobel Edge Detection Filter? |
|
|
407 | (1) |
|
Implementing the Sobel Filter as an OpenCL Kernel |
|
|
407 | (4) |
|
16 Parallelizing Dijkstra's Single-Source Shortest-Path Graph Algorithm |
|
|
411 | (14) |
|
|
|
412 | (2) |
|
|
|
414 | (3) |
|
Leveraging Multiple Compute Devices |
|
|
417 | (8) |
|
17 Cloth Simulation in the Bullet Physics SDK |
|
|
425 | (24) |
|
An Introduction to Cloth Simulation |
|
|
425 | (4) |
|
|
|
429 | (2) |
|
Executing the Simulation on the CPU |
|
|
431 | (1) |
|
Changes Necessary for Basic GPU Execution |
|
|
432 | (6) |
|
|
|
438 | (3) |
|
Optimizing for SIMD Computation and Local Memory |
|
|
441 | (5) |
|
Adding OpenGL Interoperation |
|
|
446 | (3) |
|
18 Simulating the Ocean with Fast Fourier Transform |
|
|
449 | (20) |
|
An Overview of the Ocean Application |
|
|
450 | (3) |
|
Phillips Spectrum Generation |
|
|
453 | (4) |
|
An OpenCL Discrete Fourier Transform |
|
|
457 | (6) |
|
Determining 2D Decomposition |
|
|
457 | (2) |
|
|
|
459 | (1) |
|
Determining the Sub-Transform Size |
|
|
459 | (1) |
|
Determining the Work-Group Size |
|
|
460 | (1) |
|
Obtaining the Twiddle Factors |
|
|
461 | (1) |
|
Determining How Much Local Memory Is Needed |
|
|
462 | (1) |
|
Avoiding Local Memory Bank Conflicts |
|
|
463 | (1) |
|
|
|
463 | (1) |
|
A Closer Look at the FFT Kernel |
|
|
463 | (4) |
|
A Closer Look at the Transpose Kernel |
|
|
467 | (2) |
|
|
|
469 | (18) |
|
Optical Flow Problem Overview |
|
|
469 | (11) |
|
Sub-Pixel Accuracy with Hardware Linear Interpolation |
|
|
480 | (1) |
|
Application of the Texture Cache |
|
|
480 | (1) |
|
|
|
481 | (2) |
|
Early Exit and Hardware Scheduling |
|
|
483 | (1) |
|
Efficient Visualization with OpenGL Interop |
|
|
483 | (1) |
|
|
|
484 | (3) |
|
20 Using OpenCL with PyOpenCL |
|
|
487 | (12) |
|
|
|
487 | (1) |
|
Running the PyImageFilter2D Example |
|
|
488 | (1) |
|
|
|
488 | (4) |
|
Context and Command-Queue Creation |
|
|
492 | (1) |
|
Loading to an Image Object |
|
|
493 | (1) |
|
Creating and Building a Program |
|
|
494 | (1) |
|
Setting Kernel Arguments and Executing a Kernel |
|
|
495 | (1) |
|
|
|
496 | (3) |
|
21 Matrix Multiplication with OpenCL |
|
|
499 | (16) |
|
The Basic Matrix Multiplication Algorithm |
|
|
499 | (2) |
|
A Direct Translation into OpenCL |
|
|
501 | (5) |
|
Increasing the Amount of Work per Kernel |
|
|
506 | (3) |
|
Optimizing Memory Movement: Local Memory |
|
|
509 | (2) |
|
Performance Results and Optimizing the Original CPU Code |
|
|
511 | (4) |
|
22 Sparse Matrix-Vector Multiplication |
|
|
515 | (26) |
|
Sparse Matrix-Vector Multiplication (SpMV) Algorithm |
|
|
515 | (3) |
|
Description of This Implementation |
|
|
518 | (1) |
|
Tiled and Packetized Sparse Matrix Representation |
|
|
519 | (3) |
|
|
|
522 | (1) |
|
Tiled and Packetized Sparse Matrix Design Considerations |
|
|
523 | (1) |
|
Optional Team Information |
|
|
524 | (1) |
|
Tested Hardware Devices and Results |
|
|
524 | (14) |
|
Additional Areas of Optimization |
|
|
538 | (3) |
|
|
|
541 | (40) |
|
The OpenCL Platform Layer |
|
|
541 | (2) |
|
|
|
541 | (1) |
|
Querying Platform Information and Devices |
|
|
542 | (1) |
|
|
|
543 | (1) |
|
|
|
543 | (1) |
|
|
|
544 | (2) |
|
|
|
544 | (1) |
|
Read, Write, and Copy Buffer Objects |
|
|
544 | (1) |
|
|
|
545 | (1) |
|
|
|
545 | (1) |
|
|
|
545 | (1) |
|
|
|
546 | (1) |
|
|
|
546 | (1) |
|
|
|
546 | (1) |
|
|
|
546 | (1) |
|
|
|
547 | (1) |
|
Unload the OpenCL Compiler |
|
|
547 | (1) |
|
|
|
547 | (3) |
|
|
|
547 | (1) |
|
Kernel Arguments and Object Queries |
|
|
548 | (1) |
|
|
|
548 | (1) |
|
|
|
549 | (1) |
|
Out-of-Order Execution of Kernels and Memory Object Commands |
|
|
549 | (1) |
|
|
|
549 | (1) |
|
|
|
550 | (1) |
|
|
|
550 | (2) |
|
Built-in Scalar Data Types |
|
|
550 | (1) |
|
Built-in Vector Data Types |
|
|
551 | (1) |
|
Other Built-in Data Types |
|
|
551 | (1) |
|
|
|
551 | (1) |
|
Vector Component Addressing |
|
|
552 | (3) |
|
|
|
552 | (1) |
|
Vector Addressing Equivalencies |
|
|
553 | (1) |
|
Conversions and Type Casting Examples |
|
|
554 | (1) |
|
|
|
554 | (1) |
|
|
|
554 | (1) |
|
|
|
554 | (1) |
|
Preprocessor Directives and Macros |
|
|
555 | (1) |
|
|
|
555 | (1) |
|
|
|
556 | (1) |
|
Work-Item Built-in Functions |
|
|
557 | (1) |
|
Integer Built-in Functions |
|
|
557 | (2) |
|
Common Built-in Functions |
|
|
559 | (1) |
|
|
|
560 | (3) |
|
Geometric Built-in Functions |
|
|
563 | (1) |
|
Relational Built-in Functions |
|
|
564 | (3) |
|
Vector Data Load/Store Functions |
|
|
567 | (1) |
|
|
|
568 | (2) |
|
Async Copies and Prefetch Functions |
|
|
570 | (1) |
|
Synchronization, Explicit Memory Fence |
|
|
570 | (1) |
|
Miscellaneous Vector Built-in Functions |
|
|
571 | (1) |
|
Image Read and Write Built-in Functions |
|
|
572 | (1) |
|
|
|
573 | (3) |
|
|
|
573 | (1) |
|
Query List of Supported Image Formats |
|
|
574 | (1) |
|
Copy between Image, Buffer Objects |
|
|
574 | (1) |
|
Map and Unmap Image Objects |
|
|
574 | (1) |
|
Read, Write, Copy Image Objects |
|
|
575 | (1) |
|
|
|
575 | (1) |
|
|
|
576 | (1) |
|
|
|
576 | (1) |
|
|
|
576 | (1) |
|
Sampler Declaration Fields |
|
|
577 | (1) |
|
OpenCL Device Architecture Diagram |
|
|
577 | (1) |
|
OpenCI./OpenGL Sharing APIs |
|
|
577 | (2) |
|
CI, Buffer Objects > GL Buffer Objects |
|
|
578 | (1) |
|
CI.Image Objects > GL Textures |
|
|
578 | (1) |
|
CL Image Objects > GL Renderbuffers |
|
|
578 | (1) |
|
|
|
578 | (1) |
|
|
|
579 | (1) |
|
CL Event Objects > GL Sync Objects |
|
|
579 | (1) |
|
CL Context > GL Context, Sharegroup |
|
|
579 | (1) |
|
OpenCL/Direct3D 10 Sharing APIs |
|
|
579 | (2) |
| Index |
|
581 | |