GPU Experiments: Image Convolution

One of the most commonly performed image post-processing effect is the image convolution. A number of tricks are employed to make convolutions more efficient on the GPU, such as using separable convolutions, upscaling a smaller image to fake a blur convolution, etc.

The problem with using the pixel shader to perform convolutions is the redundant texture fetching. Imagine the convolution window being slid to the right by one pixel: each time, there is a large overlap in texture fetches. Ideally, we should be able to fetch the information from the texture once, and store it into a cache. This is where compute shaders come in.

Compute shaders allow access to "groupshared" memory: in other words, memory that is shared amongst all of the threads in a group. Essentially what we can do is fill up a group's shared memory with a chunk of the texture, synchronize the threads, and then continue with the convolution. Only this time, we reference the shared memory instead of the texture.

In a future post, I will provide a more complete example. But for now, I will outline the two methods:

Method A: Pixel shader


Texture2D<float> img;

float result = 0.0f;

int w2 = (w - 1) / 2;
int h2 = (h - 1) / 2;

for (int j = -h2; j <= h2; j++)
{
    for (int i = -w2; i <= w2; i++)
    {
        result += img[int2(x + i, y + j)] * kernel[w * (j + h2) + (i + w2)];
    }
}

return result;

Above, x and y represent the position of the pixel being processed, while w and h are the width and height of the convolution kernel.

Method B: Compute shader


Texture2D<float> img;
RWTexture2D<float> outimg;

groupshared float smem[(BLOCKDIM + 2) * (BLOCKDIM + 2)];

// Read texture data into smem for this group

// Synchronize the threads
GroupMemoryBarrierWithGroupSync();

float result = 0.0f;

for (int j = 0; j < h; j++)
{
    for (int i = 0; i < w; i++)
    {
        result += smem[offset + (BLOCKDIM + 2) * j + i] * kernel[w * j + i];
    }
}

outimg[int2(x, y)] = result;

Here, BLOCKDIM is the width (and height) of threads in a group, and offset is an offset into shared memory, which is a function of the thread ID within a group.

The compute shader method substantially reduces the number of redundant fetches necessary compared to the pixel shader method, especially when using an inseparable kernel.

GPU Experiments

Saturday, February 21, 2009

Image Convolution

No comments:

Post a Comment

Followers

Blog Archive

Contributors