Image convolution is a well known topic for computer graphics applications. There are a lot of implementations available for download and also many examples within the CUDA SDK.
One of the main issues for convolution on GPU is to access data of the pixels around the active pixel to be computed. In the CUDA SDK there is an optimized examples which tackle this issue using 2 separable filters for the horizontal and vertical directions. You can find further explanation here and here .
The implementations provided with the SDK are highly optimized, but they are hard to integrate in a real application. For example, it is not so easy to change the kernel radius, because you have to recompile all the GPU code with different parameters.
I developed another CUDA implementation, used for radial blurring, which is very efficient since it makes a huge usage of shared memory.
Simple benchmarks shows that my implementation is fast enough: for a 512×512 grayscale image, my implementation takes less than 0.7 ms, while the separable convolution takes more than 0.4 ms.
Further, shared memory is sized via template parameter to the kernel function in accord with the kernel radius. This is an useful trick, because it makes possible to easily change the kernel radius without recompile the kernel.
This implementation works for single channel images, but you can easily extend it for multi-channel images. It is very simple and easy to integrate into an existing project.
I developed two version of it, the first in which image is loaded into global memory, and the second in which the image is loaded as a texture object.
Sources and wiki are available on github.
Feel free to download and improve 🙂