
The speedup seen is enormous. Lets see how far we have come in terms of efficiency:
| Implementation | Number of Rays | Time in seconds | 
|---|---|---|
| Matlab | 2^24 | 700 | 
| CUDA Scatter | 2^32 | 30 | 
| CUDA Gather | 2^32 | 10 | 
| CUDA Gather+randomize | 2^32 | 30 | 
| GLSL gather | 2^32 | 1.5 | 
The reason for such a speedup is mostly because of the number of texture lookups that are performed in the shader. The CUDA application does this using global memory, and that is a lot slower than texture lookups (because textures use the texture caches present in hardware).
An interesting issue that pops up here is that each shader kernel should complete in a very short duration of time (around 5 to 10 seconds). I have not found any reference to this online, but I have seen this messing with the results for 2^36 rays. It seems to be similar to the CUDA limit of around 5 to 10 seconds per kernel execution. One solution is to come up with shorter kernels, and another option is to run the application on a GPU that is not connected to an X-server. Since running GLSL on such a (offline) GPU is out of question, I guess I am back to smaller kernels, or implementing the whole thing in CUDA using texture memory.
Note: For people who want to try out large kernels in GLSL/CUDA, a warning: running such kernels will make your system unresponsive for the entire duration, and may as well cause display corruption. I have lost count of the system restarts I have had to do in such cases... You will know that your kernel is taking too much time, if you get garbled results like this one here:
 
 




 
 Posts
Posts
 
