Sunday, July 19, 2009

Optimizing memory access in CUDA

Now that we understand how to use textures in cuda, I think it is time to tackle a simple problem and see how we can optimize some code.

Here is a description of the diffraction problem: There is a NxN input image, and an NxN output image (N is typically 256). Each pixel on the output image receives NxN rays, one from each of the input pixels. Each of these rays deposits the color from the source pixel, modulated by a function (we will call this the transformer function). The transformer is a function of source position (x,y) and direction (theta, phi) of propogation.

The naive implementation is as follows:
1. Put the input image into global memory.
2. Create NxN threads, such that each of these threads operates on one of the output pixels.
3. Each of these threads runs a for loop over all the input pixels. For each ray hitting this output pixel, it computes the intensity and increments the output pixel intensity.

This approach does not scale with image size. For 512x512 image, having such a large for loop inside the kernel is not a good idea... So we will have to think of a more scalable solution:

1. Break the input image into buckets of bxb size, where b is around 8 or 16.
2. Now we have multiple kernel invocations, each one corresponding to a bucket. During each invocation, the threads consider rays only from this small bucket of the input data.

Also, another small complication that we introduce is that to prevent aliasing, we evaluate the intensity at input pixels with small random offsets. To facilitate this, we read in a NxN random buffer. This is also in global memory in the naive implementation.

Naive Implementation, with all global memory reads and writes
Time Taken: 2m11.074s

We try to improve things by using textures for the random array:
Time Taken: 1m47.158s

Then, we use textures even for the input image:
Time Taken: 1m34.969s

Next, we notice that each kernel does bxb computations, thus writing to global memory bxb times. By using a temporary local register to maintain this value, and writing that value only once to global memory, we can save a lot of time:
Time Taken: 0m16.530s

The time shown is for 16x16 buckets. For other bucket sizes:

8x8 : 0m17.117s
32x32: 0m16.592s

Note that the mechanism of using square buckets also ensures locality of reference and therefore makes better use of the texture caches. To really get a sense of the improvement from the original code, a 512x512 evaluation used to take around 20 minutes. and now takes 4m15s.

If you remember from my earlier post about diffraction, I wrote that a 256x256 image is evaluated in around 5 seconds by a GLSL shader implementation. We are now 3x slower than GLSL. The reason for this is that the GLSL code uses vectorization for the RGB components. Using vectorization in CUDA will take good advantage of the SIMD units and hopefully give us the remaining 3x performance boost.

No comments:

Post a Comment