cpp17 std parallel algo

parallel algorithms

The C++17 standard added the concept of parallel algorithms to the C++ Standard Library. These are additional overloads of many of the functions that operate on ranges. The parallel versions have the same signature as the “normal” single-threaded versions, except for the addition of a new first parameter, which specifies the execution policy to use.

The execution policies as defined in C++17 are: * std::execution::seq * std::execution::par * std::execution::par_unseq

  1. cppreference alogrithm
  2. cppreference
  3. P0024R2
  4. intel gettint started with pstl
  5. n3554 github, NVidia
  6. opencl based SyclParallelSTL
  7. parallel algorithm performance
  8. msdn cpp17 paralgo
  9. cpp17 parallelized algo

policy

Lattice representation of policies, as in C++17

                par_unseq
                /       \
               /         \
              /           \
             seq          par

seq policy

This is a serial execution, not a policy for parallelism. This has same algorithmic complexity as standard/normal implementation. It is executed in calling thread and in the order specified (definite order rather interleaved).

par policy

This policy provides for parallel execution across threads, #threads is defined by implementation. This may execute in main thread or in background threads. Any such invocations executing in the same thread are indeterminately sequenced with respect to each other.

par_unseq policy

This policy provides for max scope for parallelizing algorithm. The invocations are permitted to execute in an unordered fashion in unspecified threads, and unsequenced with respect to one another within each thread The execution order is not defined as execution could be interleaved. Basically it allows for parallel+vectorization

  1. whats-the-point-of-stdexecutionpar-unseq
  2. difference-between-execution-policies

exception behavior

During the execution of a parallel algorithm with any of these execution policies, if the invocation of an element access function exits via an uncaught exception, std::terminate is called, but the implementations may define additional execution policies that handle exceptions differently.

Basically, this is where the behavior diverges with respect to pre-C++17 algorithms.

  1. Parallel algorithm exceptions

An an example this simple loop could be parallelized to do some operation:

    for(size_t idx = 0; idx < HEIGHT; ++idx)
    {
        //...
    }

becomes:

    std::vector<int> ivec{};
    ivec.resize(HEIGHT+1);
    for(size_t idx = 0; idx < ivec.size(); ++idx) ivec[idx] = idx;
    
    std::for_each(std::execution::par_unseq, ivec.begin(), ivec.end(), 
                [&](int i)->void
                {
                    //...
                });

this is lot more verbose than where we started, but it allows for parallelization without much fuss. Also it is easy to run this serially by changing std::execution::par_unseq to std::execution::seq, keeping the exceptoin behavior the same (as noted above). I kind of like Intel TBB better in this case:

    
    constexprt size_t Stride = 1; // parallel, setting this to HEIGHT+1 will make it serialized
    tbb::parallel_for(tbb::blocked_range<size_t>(0, HEIGHT+1)
                [&](const tbb::blocked_range<size_t>& range)->void
                {
                    for(size_t idx = range.begin(); idx < range.end(); ++idx)
                    {
                        //...
                    }
                });

From implementation perspective, I think almost all will provide a thread-pool based execution implementation. At least thats what MS has done. Line#11 is where the work is submitted to a common thread-pool. At Line#23 _Run_chunked_parallel_work does the work of populating work in ThreadPool thru _Work_ptr. From for_each perspective Line#40-Line#50 contains all the logic to convert a for_each iteration to #tasks for parallelization.

// https://github.com/microsoft/STL/blob/580e61a5f5536016f521cc1b2ca636eadc69bd30/stl/inc/execution
//


// CLASS _Work_ptr
class _Work_ptr {
public:
    // ...

    void _Submit(const size_t _Submissions) const noexcept {
        __std_bulk_submit_threadpool_work(_Ptp_work, _Submissions);
    }

    void _Submit_for_chunks(const size_t _Hw_threads, const size_t _Chunks) const noexcept {
        _Submit(_Min_value(_Hw_threads * _Oversubmission_multiplier, _Chunks));
    }

};

//....

// FUNCTION TEMPLATE _Run_chunked_parallel_work
template <class _Work>
void _Run_chunked_parallel_work(
    const size_t _Hw_threads, _Work& _Operation) { // process chunks of _Operation on the thread pool
    const _Work_ptr _Work_op{_Operation};
    // setup complete, hereafter nothrow or terminate
    _Work_op._Submit_for_chunks(_Hw_threads, _Operation._Team._Chunks);
    _Run_available_chunked_work(_Operation);
}

//....

template <class _ExPo, class _FwdIt, class _Fn, _Enable_if_execution_policy_t<_ExPo> /* = 0 */>
void for_each(_ExPo&&, _FwdIt _First, _FwdIt _Last, _Fn _Func) noexcept /* terminates */ {
    ...
        const size_t _Hw_threads = __std_parallel_algorithms_hw_threads();
        if (_Hw_threads > 1) { // parallelize on multiprocessor machines...
            auto _Count = _STD distance(_UFirst, _ULast);
            if (_Count >= 2) { // ... with at least 2 elements
                _TRY_BEGIN
                auto _Passed_fn = _Pass_fn(_Func);
                _Static_partitioned_for_each2<decltype(_UFirst), decltype(_Count), decltype(_Passed_fn)> _Operation{
                    _Hw_threads, _Count, _Passed_fn};
                _Operation._Basis._Populate(_Operation._Team, _UFirst);
                _Run_chunked_parallel_work(_Hw_threads, _Operation);
                return;
                _CATCH(const _Parallelism_resources_exhausted&)
                // fall through to serial case below
                _CATCH_END
            }
        }
     ...
}

I am quite curious when would MS throw _Parallelism_resources_exhausted. Perhaps needs more digging, for later..