Turns out that by simply adding this one line, above one of the for loops that controls ray generation, things can get pretty fast:
#pragma omp parallel for num_threads(8)
I have 4 cores, and I did tests with no threading, 4 threads and 8 threads.
Here are the CPU occupancy charts from system monitor:
Performance:
1 Thread : 78 seconds/frame
4 Threads: 35 seconds/frame
8 Threads: 21 seconds/frame
Of course, the "clock()" function completely messes up in multi-threaded applications, and shows the same time (78 seconds) no matter what the actual time taken is. I measured these off a wall clock.
There seems to be some contention, due to which the resultant images for single threaded render and multi-threaded renders are different (by a very very small margin). The diff image has to be enhanced massively to see this. 4 pixels are off by 25%. Many pixels have a difference of 1 color value (out of 255). The images are otherwise undistinguishable to the eye. Need to fix this issue, but I will be taking a break from this project for a while, so updates may come after around 20 days.
It seems odd that you need more threads than cores to saturate your cpu. But it is not unheard of. I once read that a gentoo dev had to do make -j256 to saturate his quad core. :)
ReplyDelete