Saturday, August 29, 2009

A dash of OpenMP goodness

I work on a dual-core dual-socket 1.8 Ghz Opteron at the lab, and was wondering how much performance gain I would get out of a simple OpenMP 'parallel for' invocation...

Turns out that by simply adding this one line, above one of the for loops that controls ray generation, things can get pretty fast:

#pragma omp parallel for num_threads(8)

I have 4 cores, and I did tests with no threading, 4 threads and 8 threads.

Here are the CPU occupancy charts from system monitor:


Single Thread



Four Threads



Eight Threads


Performance:
1 Thread : 78 seconds/frame
4 Threads: 35 seconds/frame
8 Threads: 21 seconds/frame

Of course, the "clock()" function completely messes up in multi-threaded applications, and shows the same time (78 seconds) no matter what the actual time taken is. I measured these off a wall clock.

There seems to be some contention, due to which the resultant images for single threaded render and multi-threaded renders are different (by a very very small margin). The diff image has to be enhanced massively to see this. 4 pixels are off by 25%. Many pixels have a difference of 1 color value (out of 255). The images are otherwise undistinguishable to the eye. Need to fix this issue, but I will be taking a break from this project for a while, so updates may come after around 20 days.


Single Threaded Render



8 Thread Render



25 x Diff image

Monday, August 24, 2009

Ray-bounce improvements and supersampling

2 small updates are in order.

First, the easy one: Anti-aliasing. I have been rendering larger images and downscaling them for achieving antialiasing, but now I added long pending support for multiple random rays per pixel. This way, I can achieve arbitrary sampling rates like 5 rays per pixel, rather than relying on rendering larger and larger images.


No super sampling (3.2 seconds)



8x super sampling (23 seconds)


Of course, nothing nice here, because the running time increases linearly with the number of samples. But here is a theory that I have: on GPUs, larger number of rays per pixel will gather smaller and smaller overhead. Why? Because of ray coherence. Since all these rays travel the same path (mostly), the amount of warp divergence will be very low. In a 2x2 pixel scenario, if we do 8x supersampling for each pixel, the entire warp (8x4 threads) will be well behaved, mostly.

Anyway, now coming to the slightly more obscure improvement. This has to do with an algorithm that I developed for 'seamless' raytracing of point models. Earlier, we had conditions like "if new collision point is within 0.1 units of old collision point of the ray, then ignore this collision point". This is needed because splat based models are not consistent. There can be multiple overlapping splats in the scene, and often, a ray reflected or refracted from one splat, will hit the immediate next splat. This should not happen because its like the ray hitting the same surface twice. Since we are not doing sub-surface scattering :P, we would not like this to happen. But such a naive condition also causes problems with legitimate collisions. For example, we see artifacts at wall corners, and at interfaces of objects where we would expect legitimate ray-hits, but the condition prevents a ray-hit from happening

So, the alternative idea is as follows: Assume the entire scene is made of closed objects. ie: there are no loose hanging objects whose back faces we can see from outside. We add a boolean parameter to each ray, stating which face of an object it would like to hit next (front face or back face). On reflection, this parameter remains same. On refraction, it flips (True to false, false to true). Initially, all primary rays want to hit front faces of objects. Back facing collisions are ignored. Similarly, after the first refraction, objects would like to hit the back face of a splat. This way, we can prevent same-surface collisions. Here is an image illustrating the virtues of this method:


Without Seamless raytracing



With Seamless raytracing

Tuesday, August 18, 2009

Eigen2 Debug mode

Eigen's main page claims "Disabling asserts, by defining -DNDEBUG or -DEIGEN_NO_DEBUG, improves performance in some cases."
What they mean is that we need to call g++ with the -DEIGEN_NO_DEBUG option.

In this post, we test the effect of disabling asserts on a few test scenes. The first scene is a 512x512 render of a refractive sphere with reflective cornell room walls. The second scene is the same room, but 9 refractive spheres. The third scene is obtained by rendering the second scene at 1024x1024 with shadow rays. All scenes have 2 light sources.

Average running times with default settings (eigen in debug mode), were 3.6 seconds for the first scene, 4.3 seconds for the 2nd scene and 47 seconds for the third scene.



This reduces code run times pretty drastically (2.4, 2.8 and 33 seconds approximately for the three scenes described above). There is a clear 30% and above performance improvement when debug mode is disabled. Anyway, debugging with eigen is a nightmare (way too much boilerplate to sift through), and I use my own Vector library while doing serious debugging.

Octree boundary artifacts

I have traced the small artifacts that were popping up in the renders, to issues caused by octree leaf borders. It turns out that an octree border cutting through a smooth surface causes a slice to appear on the surface. This can be seen in the images below. At level 10, there are several small octree nodes, and they cause a large number of slices to appear on the surface. At level 4, there are very few borders of octree nodes, and therefore, we see no artifacts in the render. A more serious side effect is that rays start 'leaking' through the surface at these border points, and cause holes to appear on the surface. In animations, these holes pop in and out in a very distracting manner.


10 level octree (1.2 seconds to render)


8 level octree (1.3 seconds to render)


6 level octree (5 seconds to render)


4 level octree (80 seconds to render)


I don't have a solution to this problem yet, but I am thinking along the lines of adding splats to multiple octree leaves, or checking multiple nearby octree leaves while shading (rather than a single octree leaf).

Shadow Rays

Added a visibility function that works between any 2 points in space (inside or outside the octree).
This means that I can finally implement shadow rays. [Maybe even radiosity :) but that is for another day]

But for the time being, here are the results of shadowing:


Single Light source



2 Light sources


Shadow rays can be computationally heavy as I found out:


Not much time to spend today, so thats all for now.

Sunday, August 16, 2009

Reflection

Reflection, for which functionality was already in place a few days ago, is now working. Here are a few test images. Enabling reflection on all the walls has caused a considerable performance hit. Renders are now around 3 times slower.


Reflective Walls (Max Bounces: 30). Render time: 3.9 seconds


When I say max bounces are 30, it does not mean that all rays bounce 30 times. Most rays don't even make it past 7 or 8 bounces. The reflection coefficient in these images is 0.8 for the walls, and the refraction coefficient for spheres is also 0.8. Rays are traced only while the ray intensity (which starts off as 1.0) is atleast 0.01. Each bounce reduces this 'intensity' value, thereby limiting the number of bounces. So, 30 is just a 'very big number'.


Reflective Walls (Max Bounces: 5). Render Time: 3.16 seconds


Here we see a limited bounce result. With only 5 bounces, the reflections of the spheres lack detail. But we save some time this way.


Diff of 5 and 30 bounce images


Clearly, there is a considerable difference between the two images, but the better image also takes around 20-25% longer to render.


Reduced wall reflection coefficient to 0.4


Here is a more aesthetically pleasing version of the image, with reflection coefficient reduced to 0.4. The diffuse coloring of the walls is more pronounced than the reflections.

Finally, I rendered a high resolution version of the image (4096x4096) and downsampled it (1024x1024) to achieve an anti-aliased image of considerable quality. It took an eternity to render (229 seconds), but looks much better than the 512x512 version.



One small note about the images: You can notice black lines running along the corners of the cornell room. This is because of a 'minimum distance between bounces' threshold that has been set in the code. This prevents rays from bouncing twice off the same surface. Sadly it also has these side effects...

Another issue pending is stochastic reflection and refraction. Currently the system does not support surfaces that both reflect and refract, because rays never spawn off multiple secondary rays. To get this to work, the tracing engine has to trace multiple rays through each pixel. Some of these rays should reflect, and some should refract, depending on the coefficients of reflection and refraction. Monte Carlo integration may help out in blending these results together nicely.

Next on the agenda is splat blending. The refraction results look extremely noisy, and will hopefully benefit from some blending. Also to do is supersampling support, after which we will want adaptive supersampling.

Eigen2 performance

This is a short post on how average runtime of the raytracer is affected when using Eigen2 library (with compiler options O2, msse, msse2), as against rolling your own simple 3D float vector class. I have been using Vector3f from Eigen2, and have not paid any special attention to speed. I just use vectors wherever I can (positions and colors mostly). I have also not bothered about alignment of structures (obviously Vector4f would align properly, but would considerably increase memory requirements).

I was assuming that eigen2 would not give me a significant speed boost, because raytracing is not a compute heavy problem (mostly memory access bound). It turns out that I was wrong:



There is almost a 2x performance gain when I use Eigen2. Very surprising really, because the amount of work it took me is zero. All I had to do was include the right headers. Maybe when I have more time, I will make a branch of the project with Vector4f and see how alignment affects things.