Sunday, June 28, 2009

GLSL Diffraction Works!

Finally, the GLSL implementation of the diffraction project is working. Here is a sample image rendered using OpenGL:


The speedup seen is enormous. Lets see how far we have come in terms of efficiency:
ImplementationNumber of RaysTime in seconds
Matlab2^24700
CUDA Scatter2^3230
CUDA Gather2^3210
CUDA Gather+randomize2^3230
GLSL gather2^321.5


The reason for such a speedup is mostly because of the number of texture lookups that are performed in the shader. The CUDA application does this using global memory, and that is a lot slower than texture lookups (because textures use the texture caches present in hardware).

An interesting issue that pops up here is that each shader kernel should complete in a very short duration of time (around 5 to 10 seconds). I have not found any reference to this online, but I have seen this messing with the results for 2^36 rays. It seems to be similar to the CUDA limit of around 5 to 10 seconds per kernel execution. One solution is to come up with shorter kernels, and another option is to run the application on a GPU that is not connected to an X-server. Since running GLSL on such a (offline) GPU is out of question, I guess I am back to smaller kernels, or implementing the whole thing in CUDA using texture memory.

Note: For people who want to try out large kernels in GLSL/CUDA, a warning: running such kernels will make your system unresponsive for the entire duration, and may as well cause display corruption. I have lost count of the system restarts I have had to do in such cases... You will know that your kernel is taking too much time, if you get garbled results like this one here:

Friday, June 26, 2009

Today's agenda

1. Read up on textures in CUDA, and implement to see how much faster the diffraction kernel becomes.

2. Try to get the shader version working. Some bugs in the code right now cause the output to be a black screen. The processing is happening though... because the computer hangs for around 5 seconds while the shader executes...

For this, I will be using the "Adaptive tessellation for displacement mapping" source that I wrote with Rohit as the code base.

3. Purcell's PhD thesis is still elusive :)

Thursday, June 25, 2009

Updated Diffraction results

My attempts at random sampling have been *partially* successful.

Here is the original Mona Lisa Image:


This is the image after diffraction through a cloth of 100 micron weave. This image is generated through uniform sampling (as shown in the preliminary paper submitted to Siggraph asia):


This is the result of using random sampling to randomize ray source positions. The same random numbers are used for each row of the image, which gives it a striped look:


Finally, using better quality random numbers (random throughout the image, not just across a row):


Performance tradeoffs:

The random sampling in CUDA has been implemented as a lookup into a 1D array of random numbers passed by the CPU to the GPU. This means two additional lookups from a global memory array. It drastically reduces the execution speed from around 33 seconds (no randomization) to 90 seconds (with 2 random number lookups). This heavy performance hit can be attributed (IMO) to slow global memory read. Hopefully with 1D texture lookups, this will be mitigated.

The next step is to add bilinear texture sampling at these random source points. Currently, I use no filtering for the texture lookup. This means learning how to use textures in CUDA. Another avenue I am exploring is writing a GLSL shader to do the same. Preliminary results are very promising (around 5 seconds for the entire process, as compared to 33 seconds on CUDA).

cudaThreadSynchronize

This function is important for anyone who is launching a kernel many times (example: from a for loop). This is because a CUDA kernel launch is asynchronous, and returns immediately. This means that your CPU side for loop will finish in an instant and try to launch everything at once.

Calling cudaThreadSynchronize() will make the CPU wait till all previously launched kernels terminate.

Day 1

Wishlist:

1. Fix the CUDA code for Augmented photon mapping (Diffraction code).
Currently the code performs uniform, regular sampling of the diffracting surface. Adding some randomness should improve things on the aliasing front.

2. Port the code to C++ (the last test on cpu was on matlab, which doesnt really count towards efficient processing). Learn how to use OpenMP and go all out on the dual socket opteron.

3. Read Tim Purcell's Raytracing on Stream Processors PhD thesis. This is a slightly longer term plan (ie: over 3 to 4 days)

Rays of hope

Congratulations! You have just found my short-term To-Do list!

(Oh, you were looking for a blog? Sorry, but this is the bitter truth. No blog here.)

My productivity over the last few weeks has been appalling at best. So, before I spend the rest of my life clueless as to what is happening to me, I will start blogging about my day... well, at least the work part of it ;)

...which brings me to my work. For the next year or so, I will be (trying) to develop a realtime raytracer for point models. And to make things spicier, lets throw a few GPUs (Graphics cards) into the soup.

So, in short, you will see my tiny experiments on anything vaguely relevant to raytracing, parallel processing, GPU programming, and computer graphics in general.