openmp中的并行for循环

2023-09-26C/C++开发问题
7

本文介绍了openmp中的并行for循环的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着跟版网的小编来一起学习吧!

问题描述

我正在尝试并行化一个非常简单的 for 循环,但这是我很长时间以来第一次尝试使用 openMP.我对运行时间感到困惑.这是我的代码:

I'm trying to parallelize a very simple for-loop, but this is my first attempt at using openMP in a long time. I'm getting baffled by the run times. Here is my code:

#include <vector>
#include <algorithm>

using namespace std;

int main () 
{
    int n=400000,  m=1000;  
    double x=0,y=0;
    double s=0;
    vector< double > shifts(n,0);


    #pragma omp parallel for 
    for (int j=0; j<n; j++) {

        double r=0.0;
        for (int i=0; i < m; i++){

            double rand_g1 = cos(i/double(m));
            double rand_g2 = sin(i/double(m));     

            x += rand_g1;
            y += rand_g2;
            r += sqrt(rand_g1*rand_g1 + rand_g2*rand_g2);
        }
        shifts[j] = r / m;
    }

    cout << *std::max_element( shifts.begin(), shifts.end() ) << endl;
}

我用

g++ -O3 testMP.cc -o testMP  -I /opt/boost_1_48_0/include

也就是说,没有-fopenmp",我得到了这些时间:

that is, no "-fopenmp", and I get these timings:

real    0m18.417s
user    0m18.357s
sys     0m0.004s

当我使用-fopenmp"时,

when I do use "-fopenmp",

g++ -O3 -fopenmp testMP.cc -o testMP  -I /opt/boost_1_48_0/include

我得到了这些数字:

real    0m6.853s
user    0m52.007s
sys     0m0.008s

这对我来说没有意义.如何使用八个内核只能导致 3 倍性能提升?我是否正确编码循环?

which doesn't make sense to me. How using eight cores can only result in just 3-fold increase of performance? Am I coding the loop correctly?

推荐答案

您应该对 xy 使用 OpenMP reduction 子句>:

You should make use of the OpenMP reduction clause for x and y:

#pragma omp parallel for reduction(+:x,y)
for (int j=0; j<n; j++) {

    double r=0.0;
    for (int i=0; i < m; i++){

        double rand_g1 = cos(i/double(m));
        double rand_g2 = sin(i/double(m));     

        x += rand_g1;
        y += rand_g2;
        r += sqrt(rand_g1*rand_g1 + rand_g2*rand_g2);
    }
    shifts[j] = r / m;
}

使用 reduction 每个线程在 xy 中累积自己的部分和,最后将所有部分值相加,以便获取最终值.

With reduction each thread accumulates its own partial sum in x and y and in the end all partial values are summed together in order to obtain the final values.

Serial version:
25.05s user 0.01s system 99% cpu 25.059 total
OpenMP version w/ OMP_NUM_THREADS=16:
24.76s user 0.02s system 1590% cpu 1.559 total

参见 - 超线性加速 :)

See - superlinear speed-up :)

这篇关于openmp中的并行for循环的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持跟版网!

The End

相关推荐

无法访问 C++ std::set 中对象的非常量成员函数
Unable to access non-const member functions of objects in C++ std::set(无法访问 C++ std::set 中对象的非常量成员函数)...
2024-08-14 C/C++开发问题
17

从 lambda 构造 std::function 参数
Constructing std::function argument from lambda(从 lambda 构造 std::function 参数)...
2024-08-14 C/C++开发问题
25

STL BigInt 类实现
STL BigInt class implementation(STL BigInt 类实现)...
2024-08-14 C/C++开发问题
3

使用 std::atomic 和 std::condition_variable 同步不可靠
Sync is unreliable using std::atomic and std::condition_variable(使用 std::atomic 和 std::condition_variable 同步不可靠)...
2024-08-14 C/C++开发问题
17

在 STL 中将列表元素移动到末尾
Move list element to the end in STL(在 STL 中将列表元素移动到末尾)...
2024-08-14 C/C++开发问题
9

为什么禁止对存储在 STL 容器中的类重载 operator&amp;()?
Why is overloading operatoramp;() prohibited for classes stored in STL containers?(为什么禁止对存储在 STL 容器中的类重载 operatoramp;()?)...
2024-08-14 C/C++开发问题
6