CHAPTER 1 Why Parallel Computing? 1.1 Why We Need Ever-Increasing Performance 1.2 Why We're Building Parallel Systems 1.3 Why We Need to Write Parallel Programs 1.4 How Do We Write Parallel Programs? 1.5 What We'll Be Doing 1.6 Concurrent, Parallel, Distributed 1.7 The Rest of the Book 1.8 A Word of Warning 1.9 Typographical Conventions 1.10 Summary 1.11 Exercises CHAPTER 2 Parallel Hardware and Parallel Software 2.1 Some Background 2.1.1 The von Neumann architecture 2.1.2 Processes, multitasking, and threads 2.2 Modifications to the von Neumann Model 2.2.1 The basics of caching 2.2.2 Cache mappings 2.2.3 Caches and programs: an example 2.2.4 Virtual memory 2.2.5 Instruction-level parallelism 2.2.6 Hardware multithreading. 2.3 Parallel Hardware 2.3.1 SIMD systems 2.3.2 MIMD systems 2.3.3 Interconnection networks 2.3.4 Cache coherence 2.3.5 Shared-memory versus distributed-memory 2.4 Parallel Software 2.4.1 Caveats 2.4.2 Coordinating the processes/threads 2.4.3 Shared-memory 2.4.4 Distributed-memory 2.4.5 Programming hybrid systems 2.5 Input and Output 2.6 Performance 2.6.1 Speedup and efficiency 2.6.2 Amdahl's law 2.6.3 Scalability 2.6.4 Taking timings 2.7 Parallel Program Design 2.7.1 An example 2.8 Writing and Running Parallel Programs 2.9 Assumptions 2.10 Summary 2.10.1 Serial systems 2.10.2 Parallel hardware 2.10.3 Parallel software 2.10.4 Input and output 2.10.5 Performance. 2.10.6 Parallel program design 2.10.7 Assumptions 2.11 Exercises CHAPTER 3 Distributed-Memory Programming with MPI 3.1 Getting Started 3.1.1 Compilation and execution 3.1.2 MPI programs 3.1.3 MPI Init and MPI Finalize 3.1.4 Communicators, MPI Comm size and MPI Comm rank 3.1.5 SPMD programs 3.1.6 Communication 3.1.7 MPI Send 3.1.8 MPI Recv 3.1.9 Message matching 3.1.10 The status p argument 3.1.11 Semantics of MPI Send and MPI Recv 3.1.12 Some potential pitfalls 3.2 The Trapezoidal Rule in MPI 3.2.1 The trapezoidal rule 3.2.2 Parallelizing the trapezoidal rule Contents xiii 3.3 Dealing with I/O 3.3.1 Output 3.3.2 Input 3.4 Collective Communication 3.4.1 Tree-structured communication 3.4.2 MPI Reduce 3.4.3 Collective vspoint-to-point communications 3.4.4 MPI Allreduce 3.4.5 Broadcast 3.4.6 Data distributions 3.4.7 Scatter 3.4.8 Gather 3.4.9 Allgather 3.5 MPI Derived Datatypes 3.6 Performance Evaluation of MPI Programs 3.6.1 Taking timings 3.6.2 Results 3.6.3 Speedup and efficiency 3.6.4 Scalability 3.7 A Parallel Sorting Algorithm 3.7.1 Some simple serial sorting algorithms 3.7.2 Parallel odd-even transposition sort 3.7.3 Safety in MPI programs 3.7.4 Final details of parallel odd-even sort 3.8 Summary 3.9 Exercises 3.10 Programming Assignments . CHAPTER 4 Shared-Memory Programming with Pthreads . 4.1 Processes, Threads, and Pthreads 4.2 Hello, World 4.2.1 Execution 4.2.2 Preliminaries 4.2.3 Starting the threads 4.2.4 Running the threads 4.2.5 Stopping the threads 4.2.6 Error checking 4.2.7 Other approaches to thread startup 4.3 Matrix-Vector Multiplication 4.4 Critical Sections xiv Contents 4.5 Busy-Waiting 4.6 Mutexes . 4.7 Producer-Consumer Synchronization and Semaphores 4.8 Barriers and Condition Variables 4.8.1 Busy-waiting and a mutex 4.8.2 Semaphores 4.8.3 Condition variables 4.8.4 Pthreads barriers 4.9 Read-Write Locks 4.9.1 Linked list functions 4.9.2 A multi-threaded linked list 4.9.3 Pthreads read-write locks 4.9.4 Performance of the various implementations 4.9.5 Implementing read-write locks 4.10 Caches, Cache Coherence, and False Sharing 4.11 Thread-Safety 4.11.1 Incorrect programs can produce correct output 4.12 Summary 4.13 Exercises 4.14 Programming Assignments . CHAPTER 5 Shared-Memory Programming with OpenMP . 5.1 Getting Started 5.1.1 Compiling and running OpenMP programs 5.1.2 The program 5.1.3 Error checking 5.2 The Trapezoidal Rule 5.2.1 A first OpenMP version 5.3 Scope of Variables 5.4 The Reduction Clause . 5.5 The parallel for Directive 5.5.1 Caveats 5.5.2 Data dependences 5.5.3 Finding loop-carried dependences 5.5.4 Estimating 5.5.5 More on scope 5.6 More About Loops in OpenMP: Sorting . 5.6.1 Bubble sort 5.6.2 Odd-even transposition sort 5.7 Scheduling Loops 5.7.1 The schedule clause 5.7.3 The dynamic and guided schedule types 5.7.4 The runtime schedule type 5.7.5 Which schedule? 5.8 Producers and Consumers 5.8.1 Queues 5.8.2 Message-passing 5.8.3 Sending messages 5.8.4 Receiving messages 5.8.5 Termination detection 5.8.6 Startup 5.8.7 The atomic directive 5.8.8 Critical sections and locks 5.8.9 Using locks in the message-passing program 5.8.10 critical directives, atomic directives, or locks? 5.8.11 Some caveats 5.9 Caches, Cache Coherence, and False Sharing 5.10 Thread-Safety 5.10.1 Incorrect programs can produce correct output 5.11 Summary 5.12 Exercises 5.13 Programming Assignments . CHAPTER 6 Parallel Program Development 6.1 Two n-Body Solvers 6.1.1 The problem 6.1.2 Two serial programs 6.1.3 Parallelizing the n-body solvers 6.1.4 A word about I/O 6.1.5 Parallelizing the basic solver using OpenMP 6.1.6 Parallelizing the reduced solver using OpenMP 6.1.7 Evaluating the OpenMP codes 6.1.8 Parallelizing the solvers using pthreads 6.1.9 Parallelizing the basic solver using MPI 6.1.10 Parallelizing the reduced solver using MPI 6.1.11 Performance of the MPI solvers 6.2 Tree Search 6.2.1 Recursive depth-first search 6.2.2 Nonrecursive depth-first search 6.2.3 Data structures for the serial implementations 6.2.6 A static parallelization of tree search using pthreads 6.2.7 A dynamic parallelization of tree search using pthreads 6.2.8 Evaluating the pthreads tree-search programs 6.2.9 Parallelizing the tree-search programs using OpenMP 6.2.10 Performance of the OpenMP implementations 6.2.11 Implementation of tree search using MPI and static partitioning 6.2.12 Implementation of tree search using MPI and dynamic partitioning 6.3 A Word of Caution 6.4 Which API? 6.5 Summary 6.5.1 Pthreads and OpenMP 6.5.2 MPI 6.6 Exercises 6.7 Programming Assignments CHAPTER 7 Where to Go from Here References Index