Sunday, January 3, 2010

CUDA multi-gpu textures

Those of you who have used CUDA textures should be familiar with the insane programming interface that they present. Texture references can be declared only as global variables, and only in the file where you are using them. You can't pass them around as by-value/by-reference arguments in functions, you can't create them inside function/class scope, and you cannot make arrays of them.

This brings us to an interesting point. The way Multi-GPU programming is handled in CUDA is that you spawn off many CPU threads. Each CPU thread initializes its own GPU. For each thread, you create thread-specific variables that store pointers to the device memory, and allocate device memory in EACH thread. That's right, since memory is not shared between GPUs, if you are doing the same thing on many GPUs, you need to allocate the same stuff again and again, once on each GPU (of course, if you are doing Different things on different GPUs, thats a different ball game).

Now comes the fun part. Since you cannot create arrays of Texture References, and you cannot encapsulate them inside a class/structure, how on earth can you create one Texture Reference for each GPU? Should we hard code this? Should we replicate our code for each GPU we have, and just change the name of the Texture Reference to some unique name? The answer is NO!!

Well, it turns out that nVidia has sneakily done something here. When we spawn off multiple CPU threads, and select a different GPU in each thread, CUDA "automagically" creates a separate copy of the texture reference for each GPU. So, all you have to do is bind the SAME Texture reference again and again, once in each CPU thread (ie: once for each GPU). While it may look weird, because it looks like we are initializing the same(?) global variable several times, this actually works.

I have uploaded a small code sample here, that demonstrates this stuff. It loads a bunch of numbers onto the GPUs and squares them. This code sample uses some helper classes that I wrote to simplify CUDA development. This is still ongoing work, and may not be as good as a framework/library should be. Also, you will need Boost C++ libraries (the cpu threading and synchronization uses boost).

11 comments:

  1. Man you rock!!!
    I just faced this very conundrum "Since textures are globally created in the stub code, how the hell do you create private (per CPU thread) copies?"

    But you delivered once more. Kudos!!!

    Quick Q though: will this work in a parallel OMP region (meaning if I create the references in the stub where the region resides and then bind the same texture reference will it work)?

    Plz let me know :)
    Thanks

    ReplyDelete
  2. I have not tried OpenMP with this (i used boost threads), but it should work the same way. As long as you limit the number of threads in the region to the number of GPUs you have, and select a different GPU in each of the threads, I see no reason for things to go wrong (unless OpenMP does some underhanded stuff)


    Since I am no longer at university, I no longer have access to the GTX 295, and I can't test Multi-GPU situations. Do let me know if it works :)

    ReplyDelete
  3. I am still writing the code. Another question:

    I have a class for the cuda solver. The class blueprint in a header file and I have multiple .cpp file for the different functions.

    If I have one file for the stub, where I define the textures at file level and another 'cuda header' file for the kernel, will the kernel be able to access the textures if I redefine them in the kernel file with 'extern' or do I have to move the kernel to the file where the textures are first declared?

    Thanks!!
    Adam

    PS once I make it work I'll let you know how it turns out

    ReplyDelete
  4. Textures don't work with 'extern' as far as I remember. You need to have the texture declaration AND the functions using the texture in the same file.

    The best I could come up with was: declare the texture in 'file1.cu', write the cuda functions in 'file2.cu', and #include file2 inside file1...

    ReplyDelete
  5. Hi,

    Thanks for your post, I have a similar case with two GPUs (one thread assigned to each GPU). All goes well when I run only one Host thread/GPU at a time. However, it seems that my global textures are interfering with each other (texture) when I run both threads concurrently. Is there anything obvious I need to look for with the binding and handling of my global texture in each Host thread?

    Thanks for any pointers.

    ReplyDelete
  6. As far as I remember, I didn't run into any gotchas... The texture references should be independent in both the host threads. When doing multi-gpu, check if you are first spawning 2 host threads and then initializing the GPUs, allocating memory, binding references (basically everything), separately in each thread... The link for my code seems to be broken. I will try to upload it elsewhere, or I can mail it to you.

    ReplyDelete
  7. Thanks,

    Indeed, both threads (pthread) start and select a GPU each in their own context, then resources are allocated and kernels launched on each GPU. Sure, if you have a new link for the sample code you had that would be great. It can only be a mistake on my part, given that these global textures are meant to be replicated in each host thread... I could convert them to pitched arrays worse case scenario but would like to avoid that :-)!

    Thanks again for the response.

    ReplyDelete
  8. Hi Will,

    Here is a link (hope it works):

    http://dl.dropbox.com/u/2148706/multiGPU_test.tar.gz

    The code uses boost threads and is a 'bit' convoluted... I was trying to abstract out the whole kernel launch process, and this was a test program for that. Also, sorry about the (lack of) code comments.
    Hope the code works, and that it makes sense.

    ReplyDelete
  9. Hi,

    Thanks for the code sample, it did help to cement in that global textures are indeed replicated independently across multiple host threads.

    That said, here is the 'gotcha' that I came across, an to be honest my workaround seems like brut force (any comments or suggestions welcome). To summarise:

    1- I established that my global textures were getting messed up by multiple host threads, My textures are bound prior to kernels launched in their own streams, then unbound.

    3- To make it all work, I resorted to using a global pthread mutex around the cuda bind/launch/unbind calls like such:

    ---------------
    pthread_mutex_lock(&s_lock);

    // Bind the array to the texture
    _text.normalized = false;
    _text.filterMode = cudaFilterModePoint;
    _text.addressMode[0] = cudaAddressModeClamp;
    _text.addressMode[1] = cudaAddressModeClamp;
    CUDA_SAFE_CALL( cudaBindTextureToArray(_text, array) );
    myKernel<<< grid , threads, 0, stream>>>( .. );
    CUDA_SAFE_CALL( cudaUnbindTexture(_text));

    pthread_mutex_unlock(&s_lock);
    ------------------

    ... I am not convinced this is the best approach, but it certainly works.

    Will.

    ReplyDelete
  10. Hi Will,

    I think there may be a problem with the current solution. The host threads are effectively sequential now, so the kernels are also running one after another. So the dual gpus are not being used simultaneously.

    You can try enclosing different parts of the above code within a lock to see where the conflict may be happening. For example, you can unlock before the kernel launch, and lock again after it. That way, atleast you will use both GPUs simultaneously.

    <<< kernel launches are non-blocking, but the next call to cudaUnbindTexture() probably has an implicit call to cudaThreadSynchronize() >>>

    In the ideal world, you should not have to do all this locking-unlocking stuff.

    By the way, while the global texture reference is common to the threads, I hope you have 2 separate instances (one in each host thread) of the 'array' variable you have used in the code snippet. (you have most probably done this already, but just making sure..)

    ReplyDelete
  11. mmmhh, indeed I am not convinced by the brut force approach. Overall, the kernels appear to be still streamed, except that I am not sure how cudaUnbind behaves, I'll have to look at it to see if it does an implicit sync. You are probably right in that this is not providing the best CPU/GPU overlap & GPU concurrency. I am seeing some evidence of that... not doubt it could be better if I can get to the bottom of this issue at some point.

    Sure, thanks for checking, my arrays are separate instances in each thread, but bound to the same global texture :-)

    Regards,
    Will.

    ReplyDelete