Sunday, January 3, 2010

CUDA multi-gpu textures

Those of you who have used CUDA textures should be familiar with the insane programming interface that they present. Texture references can be declared only as global variables, and only in the file where you are using them. You can't pass them around as by-value/by-reference arguments in functions, you can't create them inside function/class scope, and you cannot make arrays of them.

This brings us to an interesting point. The way Multi-GPU programming is handled in CUDA is that you spawn off many CPU threads. Each CPU thread initializes its own GPU. For each thread, you create thread-specific variables that store pointers to the device memory, and allocate device memory in EACH thread. That's right, since memory is not shared between GPUs, if you are doing the same thing on many GPUs, you need to allocate the same stuff again and again, once on each GPU (of course, if you are doing Different things on different GPUs, thats a different ball game).

Now comes the fun part. Since you cannot create arrays of Texture References, and you cannot encapsulate them inside a class/structure, how on earth can you create one Texture Reference for each GPU? Should we hard code this? Should we replicate our code for each GPU we have, and just change the name of the Texture Reference to some unique name? The answer is NO!!

Well, it turns out that nVidia has sneakily done something here. When we spawn off multiple CPU threads, and select a different GPU in each thread, CUDA "automagically" creates a separate copy of the texture reference for each GPU. So, all you have to do is bind the SAME Texture reference again and again, once in each CPU thread (ie: once for each GPU). While it may look weird, because it looks like we are initializing the same(?) global variable several times, this actually works.

I have uploaded a small code sample here, that demonstrates this stuff. It loads a bunch of numbers onto the GPUs and squares them. This code sample uses some helper classes that I wrote to simplify CUDA development. This is still ongoing work, and may not be as good as a framework/library should be. Also, you will need Boost C++ libraries (the cpu threading and synchronization uses boost).

Saturday, January 2, 2010

And we are back...

It has been far too long since my last post here. I guess regular blogging and regular exercise are equally hard :)

But I have not gone into torpor for the last few months. We ported our raytracer to CUDA this October, and development on that is still going on. Visit my website for the updates on that.

What still remains is a lot of code cleanup, optimizations and a few more features to add. Without wasting too much space, here is a brief overview of everything:

Raytracing with reflection, refraction, phong shading and shadows.
Scenes are point models (points are represented as oriented disks/splats).
An Octree is constructed on the CPU and uploaded as a linear texture onto the GPU.
Each GPU thread handles a single eye ray (along with all the bounces and shadow rays)

Performance:
On scenes of about 1-4 million points, we have a preprocessing time of around 1-2 minutes.
Raytracing at 512x512 for only primary rays + phong shading ~ 100-150 fps
Raytracing at 512x512 for average 5 bounces with shadows from one light source ~ 20 fps
Raytracing at 1024x1024 with 4 shadowcasting light sources and avg 3 bounces ~ 3-4 fps

All numbers are reported for a GTX 275.