Intel Guide for Developing Multithreaded Applications

The Guide provides general advice on multithreaded performance. Hardware-specific optimizations have deliberately been kept to a minimum. In future versions of the Guide, topics covering hardware-specific optimizations will be added for developers willing to sacrifice portability for higher performance.

Application Threading
This chapter covers general topics in parallel performance but occasionally refers to API-specific issues.
1-1 – Predicting and Measuring Parallel Performance
1-2 – Loop Modifications to Enhance Data-Parallel Performance
1-3 – Granularity and Parallel Performance
1-4 – Load Balance and Parallel Performance
1-5 – Expose Parallelism by Avoiding or Removing Artificial Dependencies
1-6 – Using Tasks Instead of Threads
1-7 – Exploiting Data Parallelism in Ordered Data Streams
The topics in this chapter discuss techniques to mitigate the negative impact of synchronization on performance.
2-1 – Managing Lock Contention- Large and Small Critical Sections
2-2 – Use Synchronization Routines Provided by the Threading API Rather than Hand-Coded Synchronization
2-3 – Choosing Appropriate Synchronization Primitives to Minimize Overhead
2-4 – Use Non-blocking Locks When Possible
Memory Management
Threads add another dimension to memory management that should not be ignored. This chapter covers memory issues that are unique to multithreaded applications.
3-1 – Avoiding Heap Contention Among Threads
3-2 – Use Thread-local Storage to Reduce Synchronization
3-3 – Detecting Memory Bandwidth Saturation in Threaded Applications
3-4 – Avoiding and Identifying False Sharing Among Threads
Programming Tools
This chapter describes how to use Intel software products to develop, debug, and optimize multithreaded applications.
4-1 – Automatic Parallelization with Intel® Compilers
4-2 – Parallelism in the Intel® Math Kernel Library
4-3 – Threading and Intel® Integrated Performance Primitives