CUDA device to host copy very slow(CUDA 设备到主机复制非常慢)
问题描述
我正在运行 Windows 7 64 位、cuda 4.2、Visual Studio 2010.
I'm running windows 7 64 bits, cuda 4.2, visual studio 2010.
首先,我在 cuda 上运行一些代码,然后将数据下载回主机.然后进行一些处理并移回设备.然后我做了以下从设备到主机的复制,它运行得非常快,比如 1 毫秒.
First, I run some code on cuda, then download the data back to host. Then do some processing and move back to the device. Then I did the following copy from device to host, it runs very fast, like 1ms.
clock_t start, end;
count=1000000;
thrust::host_vector <int> h_a(count);
thrust::device_vector <int> d_b(count,0);
int *d_bPtr = thrust::raw_pointer_cast(&d_b[0]);
start=clock();
thrust::copy(d_b.begin(), d_b.end(), h_a.begin());
end=clock();
cout<<"Time Spent:"<<end-start<<endl;
大约需要 1 毫秒才能完成.
It takes ~1ms to finish.
然后我再次在 cuda 上运行了一些其他代码,主要是原子操作.然后我将数据从设备复制到主机,这需要很长时间,大约9s.
Then I ran some other code on the cuda again, mainly atomic operations. Then I copy the data from device to host, it takes very long time, like ~9s.
__global__ void dosomething(int *d_bPtr)
{
....
atomicExch(d_bPtr,c)
....
}
start=clock();
thrust::copy(d_b.begin(), d_b.end(), h_a.begin());
end=clock();
cout<<"Time Spent:"<<end-start<<endl;
~9s
我多次运行代码,例如
int i=0;
while (i<10)
{
clock_t start, end;
count=1000000;
thrust::host_vector <int> h_a(count);
thrust::device_vector <int> d_b(count,0);
int *d_bPtr = thrust::raw_pointer_cast(&d_b[0]);
start=clock();
thrust::copy(d_b.begin(), d_b.end(), h_a.begin());
end=clock();
cout<<"Time Spent:"<<end-start<<endl;
__global__ void dosomething(int *d_bPtr)
{
....
atomicExch(d_bPtr,c)
....
}
start=clock();
thrust::copy(d_b.begin(), d_b.end(), h_a.begin());
end=clock();
cout<<"Time Spent:"<<end-start<<endl;
i++
}
结果几乎相同.
可能是什么问题?
The results are pretty much the same.
What could be the problem?
谢谢!
推荐答案
问题是时间问题,而不是复制性能的任何变化.内核启动在 CUDA 中是异步的,因此您测量的不仅仅是 thrust::copy 的时间,还包括您启动的先前内核完成的时间.如果您将用于计时复制操作的代码更改为如下所示:
The problem is one of timing, not of any change in copy performance. Kernel launches are asynchronous in CUDA, so what you are measuring is not just the time for thrust::copy but also for the prior kernel you launched to complete. If you change you code for timing the copy operation to something like this:
cudaDeviceSynchronize(); // wait until prior kernel is finished
start=clock();
thrust::copy(d_b.begin(), d_b.end(), h_a.begin());
end=clock();
cout<<"Time Spent:"<<end-start<<endl;
您应该会发现传输时间已恢复到之前的性能.所以你真正的问题不是为什么 thrust::copy 慢",而是为什么我的内核慢".根据您发布的相当糟糕的伪代码,答案是因为它充满了序列化内核内存事务的 atomicExch() 调用".
You should find the transfer times are restored to their previous performance. So you real question isn't "why is thrust::copy slow", it is "why is my kernel slow". And based on the rather terrible pseudo code you posted, the answer is "because it is full of atomicExch() calls which serialise kernel memory transactions".
这篇关于CUDA 设备到主机复制非常慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!
本文标题为:CUDA 设备到主机复制非常慢
				
        
 
            
        基础教程推荐
- 如何将 std::pair 的排序 std::list 转换为 std::map 2022-01-01
 - 静态库、静态链接动态库和动态链接动态库的 .lib 文件里面是什么? 2021-01-01
 - 如何在 C++ 中初始化静态常量成员? 2022-01-01
 - 如何通过C程序打开命令提示符Cmd 2022-12-09
 - 这个宏可以转换成函数吗? 2022-01-01
 - C++结构和函数声明。为什么它不能编译? 2022-11-07
 - 如何检查GTK+3.0中的小部件类型? 2022-11-30
 - 常量变量在标题中不起作用 2021-01-01
 - 我有静态或动态 boost 库吗? 2021-01-01
 - 在 C++ 中计算滚动/移动平均值 2021-01-01
 
    	
    	
    	
    	
    	
    	
    	
    	
						
						
						
						
						
				
				
				
				