Monday, July 6, 2009

CUDA Warps and Branching

Hi, Its been a long time since I wrote something here. Over the past week I haven't really been doing much. Read a few papers and tutorials on raytracing / gpu+raytracing... and started on the cpu raytracer framework.

That apart, I have also been experimenting on GPU code and here is a small test I did. It is well documented that current nVidia GPUs have warp sizes of 32 threads and that each thread in a warp should ideally take the same branch. The idea here is that instructions are issued not to each thread, but to each warp. So, if there are 31 threads in a warp that do not have anything to do, and 1 thread that has to do heavy computation, then the entire heavy computation code is issued to all the threads, but these are ignored by the first 31 threads. This causes a delay in execution of the entire warp.

But how bad is this really? Can we verify this as a fact?? Lets find out!

Here is the sample kernel:

__global__ void simulate_gpu()
{
 int idx=blockIdx.x*blockDim.x+threadIdx.x;

 if ( idx % 64 < 32)
  for (int i=idx; i < idx+320; i++){
   double theta1 = sin((double)i / 15);
   double theta2 = sin((double)(i + 1) / 15);
   double theta3 = fmax(theta1, theta2);
   double theta4 = cos( sqrt (10.0 * theta3) );
   double theta5 = pow ( theta3, theta4 );
  }
 else
  for (int i=idx; i < idx+4; i++){
   double theta1 = sin((double)i / 15);
   double theta2 = sin((double)(i + 1) / 15);
   double theta3 = fmax(theta1, theta2);
   double theta4 = cos( sqrt (10.0 * theta3) );
   double theta5 = pow ( theta3, theta4 );
  }
 }


The above if-conditional statement generates a square wave response. the first 32 outcomes are true, next 32 are false, and so on. Note that one of the branches has 320 iterations, while the other branch has only 4 iterations.

The running time for this code for 1994096 kernel executions (including all overheads, as measured using the unix 'time' command), is: 0.872 seconds. (averaged over 4 runs).

Lets modify the condition so it looks like this:


 if ( idx % 32 < 16)


This doubles the frequency of the square wave. So the first 16 outcomes are true, next 16 false, etc...

The running time for this is: 1.370s (again avg over 4 runs). That is around 57% longer than the ideal case.

If this is not convincing enough, lets try something else. Lets remove the branch entirely:


__global__ void simulate_gpu()
{
 int idx=blockIdx.x*blockDim.x+threadIdx.x;

  for (int i=idx; i < idx+320; i++){
   double theta1 = sin((double)i / 15);
   double theta2 = sin((double)(i + 1) / 15);
   double theta3 = fmax(theta1, theta2);
   double theta4 = cos( sqrt (10.0 * theta3) );
   double theta5 = pow ( theta3, theta4 );
  }
 }


This is the part of the earlier branching code that has 320 iterations. Time for execution is 1.394s. That means although half the threads actually execute only 4 iterations, its almost equivalent to evaluating 320 iterations every time, unless your warps are well behaved. The time taken for the remaining half of the branch independently is 0.140s, just to put things into perspective...

Granted, that this way of measuring execution time is not all that accurate... but it goes to show that optimising code to minimise warp divergence is worth it.

Also, what happens if we just miss completely perfect warp control flow? What if one or 2 threads go bad? Lets find out. By setting the frequency of the square wave to slightly less than optimal, (say 62 instead of 64), we can find out how bad it really is. Here are a few such results:


 if ( idx % 62 < 31)


Time reported: 1.063s


 if ( idx % 66 < 33)


Time reported: 1.063s

Hmm, so we don't incur the full penalty. I did a few more tests for smaller numbers (56, 48 etc) and they also report similar execution times...

Next article will hopefully be on memory access optimisations... Once I figure it all out and implement the cuda memory access as C++ convenience classes.

9 comments:

  1. double???? Do we have any double capable GPU??

    ReplyDelete
  2. "It is well documented that current GPUs have warp sizes of 32 threads "

    Nit picking: Only nv gpu's have a warp size of 32 threads. AMD (4xxx series) have 64 as their warp size. Though AMD prefers to call them wavefronts. :)

    ReplyDelete
  3. double? Yes, 8xxx gpus are double capable, but give around 5x reduced performance on such code (which is why I use double here). gtx 260,280 have better double support though...

    Secondly, since this is a CUDA article, i guess AMD can have their way. Thanks for the info though.

    ReplyDelete
  4. note: When I say "double capable", i dont mean full hardware support. It just means that doubles work (albeit at considerably reduced performance).
    nVidia's Compute Capability 1.3 (GTX280 for example) specifies native support for double-precision floating-point numbers.

    ReplyDelete
  5. So double is emulated in software. I didn't know that though.

    ReplyDelete
  6. Correction: It turns out that double datatype on g92 and older hardware is just a farce... Here is an excerpt from the CUDA Best Practices guidebook:
    "When compiling for devices without native double-precision support, such as devices of compute capability 1.2 and earlier, each double variable is converted to single-precision floating-point format (but retains its size of 64 bits) and double-precision arithmetic is demoted to single-precision arithmetic."

    ReplyDelete
  7. So it is there in the compiler just for future software compatibility.

    ReplyDelete
  8. Thank you very much for your great post!
    The discussions are as great as the post.
    I'd like to learn more about warps, though. The
    fact AMD calls them "wavefront" gives me a better
    idea of what they mean. That was a nice info, RPG.
    Btw, do you know anything about hardware transcen-
    dental functions (cos, sin, exp)? I work with
    double precision on a GTX 280, and I've just heard
    of this hardware-implemented functions in a
    conference I've attended to.

    ReplyDelete
  9. The hardware transcendental functions are only available for single precision. They have a bit less accuracy than what you can get from a software implementations, but are much faster. For exact specs, look at the CUDA programming guide.

    ReplyDelete