Saturday, August 29, 2009

A dash of OpenMP goodness

I work on a dual-core dual-socket 1.8 Ghz Opteron at the lab, and was wondering how much performance gain I would get out of a simple OpenMP 'parallel for' invocation...

Turns out that by simply adding this one line, above one of the for loops that controls ray generation, things can get pretty fast:

#pragma omp parallel for num_threads(8)

I have 4 cores, and I did tests with no threading, 4 threads and 8 threads.

Here are the CPU occupancy charts from system monitor:


Single Thread



Four Threads



Eight Threads


Performance:
1 Thread : 78 seconds/frame
4 Threads: 35 seconds/frame
8 Threads: 21 seconds/frame

Of course, the "clock()" function completely messes up in multi-threaded applications, and shows the same time (78 seconds) no matter what the actual time taken is. I measured these off a wall clock.

There seems to be some contention, due to which the resultant images for single threaded render and multi-threaded renders are different (by a very very small margin). The diff image has to be enhanced massively to see this. 4 pixels are off by 25%. Many pixels have a difference of 1 color value (out of 255). The images are otherwise undistinguishable to the eye. Need to fix this issue, but I will be taking a break from this project for a while, so updates may come after around 20 days.


Single Threaded Render



8 Thread Render



25 x Diff image

Monday, August 24, 2009

Ray-bounce improvements and supersampling

2 small updates are in order.

First, the easy one: Anti-aliasing. I have been rendering larger images and downscaling them for achieving antialiasing, but now I added long pending support for multiple random rays per pixel. This way, I can achieve arbitrary sampling rates like 5 rays per pixel, rather than relying on rendering larger and larger images.


No super sampling (3.2 seconds)



8x super sampling (23 seconds)


Of course, nothing nice here, because the running time increases linearly with the number of samples. But here is a theory that I have: on GPUs, larger number of rays per pixel will gather smaller and smaller overhead. Why? Because of ray coherence. Since all these rays travel the same path (mostly), the amount of warp divergence will be very low. In a 2x2 pixel scenario, if we do 8x supersampling for each pixel, the entire warp (8x4 threads) will be well behaved, mostly.

Anyway, now coming to the slightly more obscure improvement. This has to do with an algorithm that I developed for 'seamless' raytracing of point models. Earlier, we had conditions like "if new collision point is within 0.1 units of old collision point of the ray, then ignore this collision point". This is needed because splat based models are not consistent. There can be multiple overlapping splats in the scene, and often, a ray reflected or refracted from one splat, will hit the immediate next splat. This should not happen because its like the ray hitting the same surface twice. Since we are not doing sub-surface scattering :P, we would not like this to happen. But such a naive condition also causes problems with legitimate collisions. For example, we see artifacts at wall corners, and at interfaces of objects where we would expect legitimate ray-hits, but the condition prevents a ray-hit from happening

So, the alternative idea is as follows: Assume the entire scene is made of closed objects. ie: there are no loose hanging objects whose back faces we can see from outside. We add a boolean parameter to each ray, stating which face of an object it would like to hit next (front face or back face). On reflection, this parameter remains same. On refraction, it flips (True to false, false to true). Initially, all primary rays want to hit front faces of objects. Back facing collisions are ignored. Similarly, after the first refraction, objects would like to hit the back face of a splat. This way, we can prevent same-surface collisions. Here is an image illustrating the virtues of this method:


Without Seamless raytracing



With Seamless raytracing

Tuesday, August 18, 2009

Eigen2 Debug mode

Eigen's main page claims "Disabling asserts, by defining -DNDEBUG or -DEIGEN_NO_DEBUG, improves performance in some cases."
What they mean is that we need to call g++ with the -DEIGEN_NO_DEBUG option.

In this post, we test the effect of disabling asserts on a few test scenes. The first scene is a 512x512 render of a refractive sphere with reflective cornell room walls. The second scene is the same room, but 9 refractive spheres. The third scene is obtained by rendering the second scene at 1024x1024 with shadow rays. All scenes have 2 light sources.

Average running times with default settings (eigen in debug mode), were 3.6 seconds for the first scene, 4.3 seconds for the 2nd scene and 47 seconds for the third scene.



This reduces code run times pretty drastically (2.4, 2.8 and 33 seconds approximately for the three scenes described above). There is a clear 30% and above performance improvement when debug mode is disabled. Anyway, debugging with eigen is a nightmare (way too much boilerplate to sift through), and I use my own Vector library while doing serious debugging.

Octree boundary artifacts

I have traced the small artifacts that were popping up in the renders, to issues caused by octree leaf borders. It turns out that an octree border cutting through a smooth surface causes a slice to appear on the surface. This can be seen in the images below. At level 10, there are several small octree nodes, and they cause a large number of slices to appear on the surface. At level 4, there are very few borders of octree nodes, and therefore, we see no artifacts in the render. A more serious side effect is that rays start 'leaking' through the surface at these border points, and cause holes to appear on the surface. In animations, these holes pop in and out in a very distracting manner.


10 level octree (1.2 seconds to render)


8 level octree (1.3 seconds to render)


6 level octree (5 seconds to render)


4 level octree (80 seconds to render)


I don't have a solution to this problem yet, but I am thinking along the lines of adding splats to multiple octree leaves, or checking multiple nearby octree leaves while shading (rather than a single octree leaf).

Shadow Rays

Added a visibility function that works between any 2 points in space (inside or outside the octree).
This means that I can finally implement shadow rays. [Maybe even radiosity :) but that is for another day]

But for the time being, here are the results of shadowing:


Single Light source



2 Light sources


Shadow rays can be computationally heavy as I found out:


Not much time to spend today, so thats all for now.

Sunday, August 16, 2009

Reflection

Reflection, for which functionality was already in place a few days ago, is now working. Here are a few test images. Enabling reflection on all the walls has caused a considerable performance hit. Renders are now around 3 times slower.


Reflective Walls (Max Bounces: 30). Render time: 3.9 seconds


When I say max bounces are 30, it does not mean that all rays bounce 30 times. Most rays don't even make it past 7 or 8 bounces. The reflection coefficient in these images is 0.8 for the walls, and the refraction coefficient for spheres is also 0.8. Rays are traced only while the ray intensity (which starts off as 1.0) is atleast 0.01. Each bounce reduces this 'intensity' value, thereby limiting the number of bounces. So, 30 is just a 'very big number'.


Reflective Walls (Max Bounces: 5). Render Time: 3.16 seconds


Here we see a limited bounce result. With only 5 bounces, the reflections of the spheres lack detail. But we save some time this way.


Diff of 5 and 30 bounce images


Clearly, there is a considerable difference between the two images, but the better image also takes around 20-25% longer to render.


Reduced wall reflection coefficient to 0.4


Here is a more aesthetically pleasing version of the image, with reflection coefficient reduced to 0.4. The diffuse coloring of the walls is more pronounced than the reflections.

Finally, I rendered a high resolution version of the image (4096x4096) and downsampled it (1024x1024) to achieve an anti-aliased image of considerable quality. It took an eternity to render (229 seconds), but looks much better than the 512x512 version.



One small note about the images: You can notice black lines running along the corners of the cornell room. This is because of a 'minimum distance between bounces' threshold that has been set in the code. This prevents rays from bouncing twice off the same surface. Sadly it also has these side effects...

Another issue pending is stochastic reflection and refraction. Currently the system does not support surfaces that both reflect and refract, because rays never spawn off multiple secondary rays. To get this to work, the tracing engine has to trace multiple rays through each pixel. Some of these rays should reflect, and some should refract, depending on the coefficients of reflection and refraction. Monte Carlo integration may help out in blending these results together nicely.

Next on the agenda is splat blending. The refraction results look extremely noisy, and will hopefully benefit from some blending. Also to do is supersampling support, after which we will want adaptive supersampling.

Eigen2 performance

This is a short post on how average runtime of the raytracer is affected when using Eigen2 library (with compiler options O2, msse, msse2), as against rolling your own simple 3D float vector class. I have been using Vector3f from Eigen2, and have not paid any special attention to speed. I just use vectors wherever I can (positions and colors mostly). I have also not bothered about alignment of structures (obviously Vector4f would align properly, but would considerably increase memory requirements).

I was assuming that eigen2 would not give me a significant speed boost, because raytracing is not a compute heavy problem (mostly memory access bound). It turns out that I was wrong:



There is almost a 2x performance gain when I use Eigen2. Very surprising really, because the amount of work it took me is zero. All I had to do was include the right headers. Maybe when I have more time, I will make a branch of the project with Vector4f and see how alignment affects things.

Specularity, Fresnel Terms


Specular Highlight on Opaque Sphere


Here I have used the Blinn-Phong approximation that involves computing the half-vector. The half vector is the sum of the Light and View Vectors, normalized. Dot product of this with the normal, raised to a power (specular coefficient) gives the value of the specular highlight at any point. The higher the power, the more concentrated the spot becomes.


Specular highlight + refraction


You can see that the specular highlight has become slightly dull because the sphere is transparent. Some of the light gets transmitted and some is reflected. This is implemented as a weighted average of the reflected and refracted light, where the weights are the coefficients of transmittance and reflectance.


Constant transmittance on a green sphere


This image shows a sphere with refractive index 1.01, and a transmittance of 0.9 (90% light goes through). The amount of light that gets transmitted is constant all over the sphere.


Fresnel transmittance


This image shows the same sphere, now rendered using Fresnel equations for reflection coefficient. Transmission coefficient is simply 1 - reflection coeffecient. The net effect is that when light is incident perpendicular to a surface, more light gets transmitted than if the ray hits the surface almost parallel to the surface. This can be seen in real life on a water surface. Water from a grazing angle is highly reflective, but when you look at it straight from the top, you can see right through the water. In the above image, this causes the edges of the sphere to be visible, while the rest of the sphere is highly transparent. On increasing the refractive index, the sphere becomes less transparent.

Friday, August 14, 2009

Refractions and more

Added a few basic objects to the raytracer. Now I can create point models of planes, spheres and cylinders with one function call.


Diffuse Sphere


But more importantly, Refraction finally works (I am guessing reflection will work too... but not tested yet). There is considerable aliasing right now.

Here are a few images of the refraction system at work:

9 spheres (1.5 seconds)



Intersecting spheres (2 seconds)



High resolution version (2000x2000, 12.38 seconds)


There are a few issues right now. One is that there seem to be a few rays 'leaking out through walls', due to which you can see black dots. The other is regarding rays that move parallel to the octree leaves. Such rays cannot move forward, because we cannot find their point of intersection with the octree. Adding a check for such rays only slows down the entire process, because such rays are very rare (in the images shown, they turn up as bright white spots). I am still looking for a solution to this issue.

In general, it seems that raytracing is quite hackish too... ie: a lot of messy tweaks are needed to get things working. For example, total internal reflection inside a sphere never terminates. I have a hack which prevents TIR completely, and instead allows the ray to pass through as though nothing ever happened. Then there is the issue of rays hitting overlapping splats which are nearby. Another hack which checks for a minimum distance between collisions, now allows me to avoid the 50 odd collisions that a ray would undergo, while moving through a sphere.

Thursday, August 13, 2009

Diffuse Lighting


Cornell Room, with 320000 points (0.6 seconds)


Took some major code refactoring, but I eventually managed to add diffuse lighting to the raytracer.
This includes some major additions like a new Material class, which is now used instead of per splat color values. The Material class has support for ambient, diffuse, specular materials, transparency, etc...

Of course, I also had to add a new light class (PointLight). This is more like a simple struct really.

Then there were a few other changes to incorporate shading. The Material class accepts a light, a position and a normal vector and returns the shaded value. The base class of the entire model, which I call SceneBase, contains a list of pointers to materials and lights. Pointers because then I can easily modify the materials and lights from outside, and things will dynamically change.

SceneBase derives every other scene class. The Octree class has been moved into a container which I call OctreeScene. This gives me an easier way of keeping track of the root node, than to create a global variable called root.

phew... thats a lot of messing around with code to get things working neatly...

Tuesday, August 11, 2009

Octree Traversal

This post describes a Least common ancestor based Octree traversal algorithm.

Some definitions:

Let us define the concept of 'Wall Owner'. A node is said to "own" a wall, if the wall was created while splitting the node for the first time. Formally,

A node has 6 walls surrounding it. Let us call these as "External Walls" of the node.
When we split a node "n" into 8 parts, we create 3 new walls inside this node. These walls are "Internal Walls". The node "n" is said to be the owner of these walls.

Nobody owns the external nodes of the root. ie: Owner is NULL.

For each child node, three of its walls are owned by the parent (the internal walls).
The remaining 3 walls are External walls of the parent. Who owns these walls? We know this info by asking the parent.

When a ray originates in a leaf node of the octree, we can trace it as follows:

Let the ray hit a wall "W". We know who the owner of this wall is.
We find the intersection point of the ray, on wall W, and slightly advance the ray (by a delta increment), so that the frontier of the ray is now just outside the node.
We now ask the Owner of W, to tell us which leaf node contains this point.
[ This is the improvement that least common ancestors gives us. Otherwise, we would have to traverse each time from the root node ]
Now we just say that the ray originates in this new leaf node, and continue the process.

This is like a 3D line drawing algorithm for an octree. Note that it also works for other non-uniform space partition techniques like KD-trees (there, we have only 2 children per node, hence only one internal wall per node)

It increases the space requirement by 6 pointers per node. We can reduce this to 3 pointers per node + 3 bits (to identify what position each child is in its parent: 0 to 7), because the 3 internal walls need not be stored. The parent is trivially the owner of such walls.

Relation to Naive Octree Tracing:
The algorithm described above can be converted to the naive algorithm, by simply setting the root as the wall owner for all walls. This means, every time we move from one leaf to the neighbouring leaf, we need to traverse the octree from the root node all the way down. If the depth of the octree is large, this can be a significant overhead.

What this new algorithm brings to the table, is the promise of reduced octree traversal overhead. When we move from one node at level 10 to another node at level 10, there is a very low chance of traversing all the way from the root node. We may just go from a level 7 or level 8 parent, back down to level 10. In my experiments with max depth = 10, I have observed 10% to 20% speedup using this algorithm (see the table in previous post).

Memory issue:

The first most obvious drawback is that of extra memory usage. For my test scene (empty Cornell room) consisting of 320,000 points, with a setting of maxdepth=10, and maximum points per leaf = 10, 309,623 nodes were created overall. This means that 0.3 million nodes * 6 walls per node * 4 bytes per wall = 7.2 MB of additional memory was used. The total memory requirement of each octree node is currently 100 bytes (a lot of optimization is in order), of which 24 bytes are due to the walls. Thats almost a factor of 25% increase in memory requirement. We can reduce this to around 13% using the optimization of storing only 3 wall pointers (as described above), in which case the 10-20% speed improvement is at the cost of a 13% memory requirement increase, which may be justified. Of course, using only 3 pointers will add to the number of conditional statements and make the solution a little inelegant, but it just may be worth it. Further tests with larger datasets may swing the argument in favour of this new algorithm.

A fly in the algorithm:
In some situations, the algorithm described above skips a few nodes along the traversal path. How can this occur? Here is a diagram explaining the issue:



This image shows a step in traversal of the octree. We hit the wall of a node (green circle denotes point of intersection), and now want to know which node to traverse next.
To do this, we move a little ahead along the ray so that we get a point that is just outside this wall. We then ask the parent of this node, which node contains the point. In this case, the parent of this node says that the point no longer lies inside it, and so returns null. What we should do in this case is ask the root node where this point is. We could also go step by step upwards, asking the parent of the parent and so on, but I figured we would just waste time that way.

Sunday, August 9, 2009

Raytracing at last...

(Red Plane: 10x10 Splats)


(Red Plane: 1600x1600 Splats)


My CPU ray tracer finally works. Um... well, it only traces primary rays right now, and it is single threaded... and there is no filtering... ok ok, its not that big a deal, but I will blog about it anyway :P

The ray tracer is for point based scenes. The only primitive we deal with are splats (points with radii, colour and normal vectors). The acceleration structure is an octree, but I feel the traversal algorithms we use right now can be extended to kd-trees as well. This is my first working ray tracer. It will serve as a precursor to a faster, cooler ray tracer implemented entirely on the GPU. It also provides me with a test bed to debug ideas before implementing them in CUDA.

The octree traversal algorithm is one I came up with last year while doing a quad-tree traversal assignment. It involves storing pointers to nearest common ancestors, at each wall in the tree. Some modifications were required to implement a sparse octree, where empty nodes are not stored at all, but in all, it was not too messy to implement this.

The first thing I noticed about the implementation was that it is still a bit buggy. In the rendering with 1600x1600 points, there seem to be small holes in the plane, where rays are escaping out into the background. It turns out to be a problem caused due to the discretization of space that the octree produces. Some splats have their center in one node, but their radius causes them to 'spill over' into nearby nodes. This is not tracked by the current octree construction algorithm, due to which some splats are missed.

The next thing, which is the cool part is that the code truly seems to run in log(objects) time. The run time for various scene sizes (512x512 render window, no super sampling):

PointsRender TimeOctree Creation TimeMemoryNaive Render Time
10x100.39---50.42
100x1000.520.0260.58
200x2000.540.0490.6
400x4000.590.18210.63
800x8000.650.82710.7
1600x16000.73.52710.76
5000x50000.85371700??

(Time in seconds, memory in mega bytes)

The timings in the last column refer to the naive algorithm of traversing the tree from the root each time, without using the least common ancestor idea. It turns out to be a very small benefit. Maybe further optimizations are in order...

Of course, there is the memory issue: It takes around 270 MegaBytes for a scene with 2.56 million splats in it. Octree generation also seems to take more time than the actual raytrace, for such large scenes. (We will not go into the details of how much swapping I experienced with the 25 million points scene. I am never running that again)

What remains now is to add the usual raytracing frills like secondary rays, shading, supersampling etc...

"As fast as GLSL"(TM)

Yes, its been a looong time.
So, let me get started with the first item on the menu. I promised to write a "SIMD" version of the diffraction code. It turns out that CUDA gpus do not do 4 wide vector operations (according to a reasonably reliable source). I myself have found no reference to vectorized ALU in the GPU, and all references to SIMD on GPU point to warps, which are in some sense, single instruction multiple data units. So I am now to assume that we can only do scalar operations on each shader unit.

It turns out that we can nevertheless try to squeeze out the requisite performance from CUDA. The GLSL implementation of diffraction through a 256x256 diffraction grating runs in around 5 seconds. It turns out that if we do a single float4 texture read in each iteration of the kernel, and write a float4 at a time to the output buffer, we can save quite some time in the kernel.

The earlier implementations took at best 16.5 seconds to complete the process. The new implementation now just takes 5.6 seconds of which the actual system time is just 4.5 seconds.

Pitfalls:

It turns out that not all is well on the CUDA compiler front. I ran into multiple bugs while implementing this last optimization.
Firstly, I was unable to pass a particular float argument from the cpu code through the kernel, and into 2 device functions. It simply kept mucking up each and every time. I finally gave up and hard-coded that value into the kernel. Later I found that the bug was gone in CUDA 2.3. Not really sure what the bug was, but it may have been related to this next issue.

The CUDA compiler seems to really mess up if you do not EXPLICITLY specify each constant floating point value as single precision (0.65f instead of 0.65). It once in a while throws a warning about 'Double' not being supported. You won't even know where this is because the line number is from the ptx file. Heed this warning and remove all doubles from your code (if you are using g80-g92 that is). These innocuous warnings actually come with drastic side effects.

For example, I had defined a pure function which had a few floating point constants inside it. A single invocation of this function, per iteration in the kernel, would work fine. But 2 or 3 invocations would result in an abrupt return from the kernel, so no processing would be done at all.

Here is the kernel code (brought to you by the magic of vim and :TOhtml)




__global__ void gpu_emit_photons(float4* buffer, int inputRow, int inputCol, float pitch)
  {
  const int tid_x = threadIdx.x+blockDim.x*blockIdx.x;//destination col
  const int tid_y = threadIdx.y+blockDim.y*blockIdx.y;//destination row

  float srcPosX, srcPosY, destPosX, destPosY, deltaPosX, deltaPosY;
  const float scrSize_2_inv = 2.0f/screenSize;
  const float lambdaRed = 0.72f, lambdaGreen = 0.53f, lambdaBlue = 0.47f;
  float theta, phi, uRed, vRed, uGreen, vGreen, uBlue, vBlue;
  
  int i,j;
  
  float4 intensity;
  float tempIntensityR = 0.0f, tempIntensityG = 0.0f, tempIntensityB = 0.0f;
  
  for (i=inputRow; i<inputRow+BUCKET_SIZE; i++){
    for (j=inputCol; j<inputCol+BUCKET_SIZE; j++){
      float rand1= -tex2D(texRef1, i, j);
      float rand2= -tex2D(texRef1, i, j+1);

      intensity = tex2D(texRef2, (float)j+rand1, (float)i+rand2);
      srcPosX=posMax*(1.0f- ((float)i+rand1)*scrSize_2_inv);
      srcPosY=posMax*(1.0f- ((float)j+rand2)*scrSize_2_inv);

      destPosX=posMax*(1.0f-(float)tid_x*scrSize_2_inv);
      destPosY=posMax*(1.0f-(float)tid_y*scrSize_2_inv);

    //deltaPos is the difference between x coordinates of src and dest
      deltaPosX=srcPosX-destPosX;
      deltaPosY=srcPosY-destPosY;

      theta = atan2( deltaPosX,screenDistance);
      phi = atan2( deltaPosY,screenDistance);
      
      uRed = theta/lambdaRed; vRed = phi/lambdaRed;
      uGreen = theta/lambdaGreen; vGreen = phi/lambdaGreen;
      uBlue = theta/lambdaBlue; vBlue = phi/lambdaBlue;          

    //Assume light is incident perpendicularly onto the optical element;
    //In the general case, 'u' becomes u-u_input and 'v' becomes v-v_input, where u_input and v_input are angles of incidence
      
      tempIntensityR += wdf_analytical(srcPosX,srcPosY,uRed,vRed,lambdaRed,pitch) * intensity.x;
      tempIntensityG += wdf_analytical(srcPosX,srcPosY,uGreen,vGreen,lambdaGreen,pitch) * intensity.y;
      tempIntensityB += wdf_analytical(srcPosX,srcPosY,uBlue,vBlue,lambdaBlue,pitch) * intensity.z;
      }
    }
    buffer[tid_y + tid_x*SIZE] = buffer[tid_y + tid_x*SIZE] + make_float4(tempIntensityR,tempIntensityG,tempIntensityB,0);
  }



In the above code, the function call of interest is wdf_analytical(...). This function evaluates the wigner distribution value at the specified x,y,theta,phi coordinates. Its like a 4d lookup table. We need to invoke this function once for each wavelength of light that we want to handle (here, red, green, blue). It turns out that the simple double to float warning had some unforseen ramifications in terms of crashing my code when the function was called more than once. For a single invocation, it worked like a charm.

Well, now of course, everything is in order, and the code works quite fast. The time for optimizations has almost come to an end. One further optimization I should do before closing the topic however, is to use the Leapfrog Lagged Fibonacci Generator (LLFG random number generator) which a bunch of clever guys at my lab have coded up in CUDA. They claim its the world's fastest random number generator (faster than Mersenne Twister on CUDA). It may give good performance benefits versus using a random texture. Only (a lot of) time will tell...