A blog which discusses various GPU applications including visualization, GPGPU and games.

Saturday, February 21, 2009

Image Convolution

One of the most commonly performed image post-processing effect is the image convolution. A number of tricks are employed to make convolutions more efficient on the GPU, such as using separable convolutions, upscaling a smaller image to fake a blur convolution, etc.

The problem with using the pixel shader to perform convolutions is the redundant texture fetching. Imagine the convolution window being slid to the right by one pixel: each time, there is a large overlap in texture fetches. Ideally, we should be able to fetch the information from the texture once, and store it into a cache. This is where compute shaders come in.

Compute shaders allow access to "groupshared" memory: in other words, memory that is shared amongst all of the threads in a group. Essentially what we can do is fill up a group's shared memory with a chunk of the texture, synchronize the threads, and then continue with the convolution. Only this time, we reference the shared memory instead of the texture.

In a future post, I will provide a more complete example. But for now, I will outline the two methods:

Method A: Pixel shader

Texture2D<float> img;

float result = 0.0f;

int w2 = (w - 1) / 2;
int h2 = (h - 1) / 2;

for (int j = -h2; j <= h2; j++)
{
for (int i = -w2; i <= w2; i++)
{
result += img[int2(x + i, y + j)] * kernel[w * (j + h2) + (i + w2)];
}
}

return result;

Above, x and y represent the position of the pixel being processed, while w and h are the width and height of the convolution kernel.

Method B: Compute shader

Texture2D<float> img;
RWTexture2D<float> outimg;

groupshared float smem[(BLOCKDIM + 2) * (BLOCKDIM + 2)];

// Read texture data into smem for this group

// Synchronize the threads
GroupMemoryBarrierWithGroupSync();

float result = 0.0f;

for (int j = 0; j < h; j++)
{
for (int i = 0; i < w; i++)
{
result += smem[offset + (BLOCKDIM + 2) * j + i] * kernel[w * j + i];
}
}

outimg[int2(x, y)] = result;

Here, BLOCKDIM is the width (and height) of threads in a group, and offset is an offset into shared memory, which is a function of the thread ID within a group.

The compute shader method substantially reduces the number of redundant fetches necessary compared to the pixel shader method, especially when using an inseparable kernel.

No comments:

Post a Comment

Followers