parallel algorithms
The C++17 standard added the concept of parallel algorithms to the C++ Standard Library. These are additional overloads of many of the functions that operate on ranges. The parallel versions have the same signature as the �normal� single-threaded versions, except for the addition of a new first parameter, which specifies the execution policy to use.
The execution policies as defined in C++17 are:
- std::execution::seq
- std::execution::par
- std::execution::par_unseq
links:
- cppreference alogrithm
- cppreference
- P0024R2
- intel gettint started with pstl
- n3554 github, NVidia
- opencl based SyclParallelSTL
- [parallel algorithm performance] (https://www.bfilipek.com/2018/11/parallel-alg-perf.html)
- msdn cpp17 paralgo
- cpp17 parallelized algo
policy
Lattice representation of policies, as in C++17
par_unseq
/ \
/ \
/ \
seq par
seq policy
This is a serial execution, not a policy for parallelism. This has same algorithmic complexity as standard/normal implementation. It is executed in calling thread and in the order specified (definite order rather interleaved).
par policy
This policy provides for parallel execution across threads, #threads is defined by implementation. This may execute in main thread or in background threads. Any such invocations executing in the same thread are indeterminately sequenced with respect to each other.
par_unseq policy
This policy provides for max scope for parallelizing algorithm. The invocations are permitted to execute in an unordered fashion in unspecified threads, and unsequenced with respect to one another within each thread The execution order is not defined as execution could be interleaved. Basically it allows for parallel+vectorization
exception behavior
During the execution of a parallel algorithm with any of these execution policies, if the invocation of an element access function exits via an uncaught exception, std::terminate is called, but the implementations may define additional execution policies that handle exceptions differently.
Basically, this is where the behavior diverges with respect to pre-C++17 algorithms.
An an example this simple loop could be parallelized to do some operation:
|
|
becomes:
|
|
this is lot more verbose than where we started, but it allows for parallelization without much fuss. Also it is easy to run this serially by changing std::execution::par_unseq
to std::execution::seq
, keeping the exceptoin behavior the same (as noted above). I kind of like Intel TBB
better in this case:
|
|
From implementation perspective, I think almost all will provide a thread-pool based execution implementation. At least thats what MS has done.
Line#11 is where the work is submitted to a common thread-pool. At Line#23 _Run_chunked_parallel_work
does the work of populating work in ThreadPool thru _Work_ptr
. From for_each
perspective Line#40-Line#50 contains all the logic to convert a for_each
iteration to #tasks for parallelization.
|
|
I am quite curious when would MS throw _Parallelism_resources_exhausted
. Perhaps needs more digging, for later..