當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

CUDA程序优化技巧

發布時間：2025/3/15 编程问答 14 豆豆

生活随笔收集整理的這篇文章主要介紹了 CUDA程序优化技巧小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

CUDA程序優化技巧

2013-11-18 23:41 1469人閱讀評論(4) 收藏舉報分類： CUDA（24）

目錄(?)[+]

有如下幾個方面

1. 使用共享內存減少全局內存讀取次數

減少全局內存的重復數據的重復訪問，此處大有學問，需要設計我們的線程組織模式，最大可能利用共享內存，可參考矩陣乘法優化問題；

2. 把全局內存綁定為紋理；

3. 減少bank conflict, 讓不同線程讀取連續內存。

半束的線程如果訪問的全局內存是16B的倍數，則可以合并為一次訪問，所以要求，連續的線程訪問連續的內存

對于共享內存http://blog.csdn.net/smsmn/article/details/6336060(在CPU程序中是，提高cash命中率，但此時似乎不是cash命中率的問題)；

4. 尺寸和對齊的要求內存對齊。

因為GPU 上的內存控制器，從某個固定的倍數地址開始讀取，才會有最高的效率（例如 16 bytes 的倍數）。分配內存時使用cudaMallocPitch替代cudaMalloc，相應?cudaMemcpy2D替代?cudaMemcpy。（這其實和(2)中原理類似）。主要是對全局內存

5.合并訪問

不同線程的訪問合并。如果相鄰線程訪問相鄰的數據，則可以合并

注：

global memory 的存取，要盡可能的連續。這是因為 DRAM 存取的特性所造成的結果。更精確的說，global memory 的存取，需要是 "coalesced"。所謂的 coalesced，是表示除了連續之外，而且它開始的地址，必須是每個 thread 所存取的大小的 16 倍。例如，如果每個 thread 都讀取 32 bits 的數據，那么第一個 thread 讀取的地址，必須是 16*4 = 64 bytes 的倍數。

6.使用流并行

代碼一：

[cpp] view plaincopy

void?my_jpc_colgr_gpu(int?**data_a,?int?height,int?width,int?current_lvls){//當大于512時執行這個函數????

????cudaStream_t?stream[3];??

????dim3?blocks(width);//width個線程塊??

????dim3?threads(256);//讓一個線程塊計算一列,當height>1024時，得想辦法??

????for(int?i?=?0;i<3;i++)????

????{????

????????cudaStreamCreate(&stream[i]);???//創建流????

????}????

????for(int?i?=?0;i<3;i++)????

????{????

????????jpc_colgr_ker<<<blocks,threads,sizeof(int)*(height+1)>>>(dev_a+height_tmp*i*a_pitch/4,height,width,a_pitch/4,height&1,height_tmp*i);??

????}??

????cudaDeviceSynchronize();??

}??

流并行屬于，任務級別的并行，當我們有幾個互不相關的任務時，可以寫多個核函數，資源允許的情況下，我們將這些核函數裝載到不同流上，然后執行，這樣可以實現更粗粒度的并行。

實驗中發現，流并行的效率和增加一個線程網格的維度的方法的效率一樣。以上代碼可以采用增加線程塊數目的方法來實現

代碼二：

[cpp] view plaincopy

void?my_jpc_colgr_gpu(int?**data_a,?int?height,int?width,int?current_lvls){//當大于512時執行這個函數????

????dim3?blocks(width,3);//width個線程塊??

????dim3?threads(256);//讓一個線程塊計算一列,當height>1024時，得想辦法??

????jpc_colgr_ker<<<blocks,threads,sizeof(int)*(height+1)>>>(dev_a,height,width,a_pitch/4,height&1,height_tmp);??

????}??

對比以上兩個代碼片段發現，效率一樣。

CUDA中的流用cudaStream_t類型實現，用到的API有以下幾個：cudaStreamCreate(cudaStream_t * s)用于創建流，cudaStreamDestroy(cudaStream_t s)用于銷毀流，cudaStreamSynchronize()用于單個流同步，cudaDeviceSynchronize()用于整個設備上的所有流同步，cudaStreamQuery()用于查詢一個流的任務是否已經完成。具體的含義可以查詢API手冊。

7.使用并行流技術將數據拷貝和計算并行化

Asynchronous Transfers and Overlapping Transfers with Computation

Data transfers between the host and the device using cudaMemcpy() are blocking transfers; that is, control is returned to the host thread only after the data transfer is complete. ThecudaMemcpyAsync() function is a non-blocking variant ofcudaMemcpy() in which control is returned immediately to the host thread. In contrast withcudaMemcpy(), the asynchronous transfer versionrequires pinned host memory (seePinned Memory), and it contains an additional argument, a stream ID. A stream is simply a sequence of operations that are performed in order on the device. Operations in different streams can be interleaved and in some cases overlapped—a property that can be used to hide data transfers between the host and the device.

Asynchronous transfers enable overlap of data transfers with computation in two different ways. On all CUDA-enabled devices, it is possible to overlap host computation with asynchronous data transfers and with device computations. For example,Overlapping computation and data transfers demonstrates how host computation in the routinecpuFunction() is performed while data is transferred to the device and a kernel using the device is executed.

Overlapping computation and data transfers

cudaMemcpyAsync(a_d, a_h, size, cudaMemcpyHostToDevice, 0); kernel<<<grid, block>>>(a_d); cpuFunction();

The last argument to the cudaMemcpyAsync() function is the stream ID, which in this case uses the default stream, stream 0. The kernel also uses the default stream, and it will not begin execution until the memory copy completes; therefore, no explicit synchronization is needed. Because the memory copy and the kernel both return control to the host immediately, the host functioncpuFunction() overlaps their execution.

In Overlapping computation and data transfers, the memory copy and kernel execution occur sequentially. On devices that are capable of “concurrent copy and execute,” it is possible to overlap kernel execution on the device with data transfers between the host and the device. Whether a device has this capability is indicated by the deviceOverlap field of a cudaDeviceProp variable (or listed in the output of thedeviceQuery SDK sample). On devices that have this capability, the overlap once again requires pinned host memory, and, in addition, the data transfer and kernel must use different, non-default streams (streams with non-zero stream IDs). Non-default streams are required for this overlap because memory copy, memory set functions, and kernel calls that use the default stream begin only after all preceding calls on the device (in any stream) have completed, and no operation on the device (in any stream) commences until they are finished.

Concurrent copy and execute illustrates the basic technique.

Concurrent copy and execute

cudaStreamCreate(&stream1); cudaStreamCreate(&stream2); cudaMemcpyAsync(a_d, a_h, size, cudaMemcpyHostToDevice, stream1); kernel<<<grid, block, 0, stream2>>>(otherData_d);

In this code, two streams are created and used in the data transfer and kernel executions as specified in the last arguments of thecudaMemcpyAsync call and the kernel’s execution configuration.

Concurrent copy and execute demonstrates how to overlap kernel execution with asynchronous data transfer. This technique could be used when the data dependency is such that the data can be broken into chunks and transferred in multiple stages, launching multiple kernels to operate on each chunk as it arrives. Sequential copy and execute and Staged concurrent copy and execute demonstrate this. They produce equivalent results. The first segment shows the reference sequential implementation, which transfers and operates on an array of N floats (where N is assumed to be evenly divisible by nThreads).

Sequential copy and execute

cudaMemcpy(a_d, a_h, N*sizeof(float), dir); kernel<<<N/nThreads, nThreads>>>(a_d);

Staged concurrent copy and execute shows how the transfer and kernel execution can be broken up into nStreams stages. This approach permits some overlapping of the data transfer and execution.

Staged concurrent copy and execute

size=N*sizeof(float)/nStreams; for (i=0; i<nStreams; i++) { offset = i*N/nStreams; cudaMemcpyAsync(a_d+offset, a_h+offset, size, dir, stream[i]); } for (i=0; i<nStreams; i++) { offset = i*N/nStreams; kernel<<<N/(nThreads*nStreams), nThreads, 0, stream[i]>>>(a_d+offset); }

(In Staged concurrent copy and execute, it is assumed that N is evenly divisible bynThreads*nStreams.) Because execution within a stream occurs sequentially, none of the kernels will launch until the data transfers in their respective streams complete. Current hardware can simultaneously process an asynchronous data transfer and execute kernels. (It should be mentioned that it is not possible to overlap a blocking transfer with an asynchronous transfer, because the blocking transfer occurs in the default stream, and so it will not begin until all previous CUDA calls complete. It will not allow any other CUDA call to begin until it has completed.) A diagram depicting the timeline of execution for the two code segments is shown inFigure 1, and nStreams=4 for Staged concurrent copy and execute is shown in the bottom half.

Figure 1. Timeline Comparison for Sequential (top) and Concurrent (bottom) Copy and Kernel Execution

For this example, it is assumed that the data transfer and kernel execution times are comparable. In such cases, and when the execution time (t_E) exceeds the transfer time (t_T), a rough estimate for the overall time ist_E +t_T/nStreams for the staged version versust_E +t_T for the sequential version. If the transfer time exceeds the execution time, a rough estimate for the overall time ist_T +t_E/nStreams.

8.使用Pinned Memory優化內存和顯存間的拷貝

有效的一個方法是將cpu內存通過cudaHostRegister()，綁定為分頁鎖定內存。

Pinned Memory

Page-locked or pinned memory transfers attain the highest bandwidth between the host and the device. On PCIe ×16 Gen2 cards, for example, pinned memory can attain greater than 5 GBps transfer rates.

Pinned memory is allocated using the cudaMallocHost() orcudaHostAlloc() functions in the Runtime API. ThebandwidthTest.cu program in the CUDA SDK shows how to use these functions as well as how to measure memory transfer performance.

Pinned memory should not be overused. Excessive use can reduce overall system performance because pinned memory is a scarce resource. How much is too much is difficult to tell in advance, so as with all optimizations, test the applications and the systems they run on for optimal performance parameters.

Parent topic: Data Transfer Between Host and Device

總結

以上是生活随笔為你收集整理的CUDA程序优化技巧的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： python如何输入多个数据并增加到一个
下一篇：铁幕一体计算机配置,讽刺的铁幕来自俄罗斯