Rendering

Motivation

Now that we have obtained some 3d structure of the object, and a dynamic texture, we would like to be able to generate photo-realistic renderings of the object in new views. The process of doing so is quite simple but requires a vast amount of arithmetic. Since we needed the rendering to be real time on consumer grade PC's, we have developed efficient methods for rendering the objects using both hardware and software.

Theory & Implementation

For each new view of the object, we first need to obtain the appropriate coefficients for the view before generating the texture. These coefficients are dependent on the viewing pose of the object being rendered. Thus the coefficients are obtained by interpolating coefficients of the known views (these are the views that were used to capture the object and obtain the texture) that are nearest neighbors to the current view.

With the coefficients now determined, we can generate the dynamic texture, by multiplying each eigenvector by its coefficient and summing the results. Finally, we use texture mapping to map the texture onto the 3d model.

Of these steps, the generation of the dynamic texture is by far the most expensive so efficient methods will now be discussed in detail for both hardware and software.

Hardware

To render the dynamic texture in hardware we use the texture blending features available on most consumer graphics cards. Texture blending can be performed in either a single pass, or multipass. Older hardware only supports multi-pass blending, which is the simplest. With multi-pass blending, you first draw some geometry with one texture, then set the blending function(add/subtract/...) and coefficients then draw the entire geometry again with a second texture. The results are blended together according to the blending function and the coefficients. Multi-pass blending can be accessed with OpenGL 1.3 imaging subset supported by many cards.

Single-pass blending means that multiple textures are loaded simultaneously, along with the parameters for how to combine them, and then the geometry is rendered only once. Textures are combined and rendered simultaneously, which saves the rendering pipeline from processing all the geometric data more than once(camera transforms, clipping, etc.), which can be expensive if the geometry is detailed. Primitive single pass blending can only be accessed through the OpenGL multitexturing extension, but to access all the blending features of current graphics cards, we have to use proprietary OpenGL extensions created separately by each vendor. ATI has it's fragment shader extension, and NVidia has the register combiner and texture shader extensions. The number of textures that can be processed in a single pass is hardware dependent (NVidia GeForce 3/4 can access 4 textures at once, ATI Radeon 9700 does 8, and the GeForceFX does 16).

We will first discuss a general multipass implementation, but since we may need to blend up to 100 textures we will then show how this can be improved by taking advantage of the underlying hardware. The basic idea idea is to successively blend each eigenvector with its coefficient into a texture. The rendering hardware is designed for textures containing positive values only, while the basis is a signed quantity. Thus, to blend the textures, we need to add the positive quantities, then subtract the absolute values of the negative quantities. This two step method produces the same results as if the hardware could handle the signed quantities.

The first improvement of this algorithm is to use single pass texture blending in the inner loop to do as many of the textures as possible in one loop. Since we are using nvidia register combiners, the previous change lets us store the textures in a signed format, reducing the texture space required by 50%. On top of that, with register combiners, we can convert between color spaces while blending textures. The means that we can store the Y,U and V eigenvectors separately and use more memory and rendering time on the more important Y channel and less on U and V channels.

For the improved algorithm, on each pass we now load 4 Y eigenvectors into one RGBA texture in each available texture unit(we are using GeForce3's, so that is 4), 4 coefficients are loaded to go with each eigenvector, and then the register combiners are used to efficiently multiply each eigenvector by its coefficient and then sum the results. Before adding the sum from the current pass to the contents of the framebuffer, it is multiplied by a YUV to RGB conversion matrix. These changes allow us to do 16 eigenvectors per pass(with GeForce3's) and thus the loop is executed 1/16 the number of times as in the strict multi pass blending.

Some recent cards support floating point textures and frame buffers, but most cards generally require the textures and frame buffer to be stored in bytes, so any intermediate results of the blending(addition or subtraction) will be clipped to 0-255. This can be problematic since while the resulting image is guaranteed to be within this range, some of the intermediate results may go outside it, resulting in pixel error. This problem cannot be completely solved, but by first drawing the mean image, then alternatively adding and subtracting, the intermediate overflow is avoided in most cases.

Software

Generating a texture from the basis images in software can be accomplished naively by looping through each basis image, multiplying each pixel in the image by the corresponding coefficient and adding this sum to an accumulated texture(Note: in general this must be done on all channels, but we will discuss the optimization on one channel for simplicity). For k basis images of size m*n, this requires m*n*k multiplications and m*n*(k-1) additions. Although, there is no way of reducing the inherent complexity of the problem, there is a way to speed up the process using features available on most all modern CPU's. Single Instruction Multiple Data (SIMD) instructions allow the cpu to perform one operation on multiple data items, and can be used to increase performance of algorithms in many domains. The MMX instruction set extension was added to the intel pentium processor in 1997, and since then has been incorporated into the instruction set of its intel predecessors, as well AMD processors. This instruction set has the ability to work on packed-byte, packed-word packed-doubleword and quadword operands and operates in 8 floating point registers. The instruction set includes operations to perform arithmetic, comparison and logic operations on the above defined types, as well as other instructions for packing and unpacking into the desired type. Although many other SIMD extensions such as 3d NOW! (AMD),SSE (Pentium 3), SSE2 (Pentium 4) exist, our implementation only required the manipulation of bytes, so the MMX instruction set was sufficient(not to mention it is availability on most CPU's).

By allowing the algorithm to make use of SIMD instructions, we can reduce the number of instructions needed for generating the dynamic textures, thus increasing the frame rate or allowing for more complex scenes. There are two natural ways which one may see that the instruction set can be used to optimize the algorithm:
  1. Use the parallel multiplication to perform 4 coefficient multiplications, then use a parallel add to accumulate these 4 values. More specifically, initialize a accumulation texture to zeros. Now for each image, we maintain a pointer to the current image pixel and load 4 consecutive pixels into a register x(say mm0). Now parallel add the 4 pixels to the appropriate place in the accumulation texture. Now we can increment the pointer to the current image pixel by 4.
  2. Use the multiply and add packed words(pmaddwd) instruction to to perform the multiplication and addition of two pixels from two images all in one instruction. This algorithm is a little harder to understand conceptually, but the general idea is explained in the figure below, where refers to the j-th pixel in image i. By using two of these operations, we can accomplish the addition of four pixels from two images to our accumulation texture, thus we can increment the inner loop by four, and the outer loop by 2.
    Register 0:(coefficients)
    Register 1:(pixels)
    Result(after pmwadd):

Color conversion using MMX

Since the objects are currently stored in yuv format, we also need to do a color conversion once the dynamic texture has been generated. This color conversion is basically a large number of matrix multiplications, which like the texture blending can be done efficiently using mmx. Intel has a very good example explaining exactly how this can be done.

Experiments

All of the experiments were performed using the rendering system available on the website, which has been compiled using g++ with optimization set to level 3.

House data set
Intel 2.4ghz, 1024MB, Geforce 4 AMD 2@1.5ghz, 512 MB, Geforce 3
No MMX 17 fps 28 fps
MMX 1 85 fps 37 fps
MMX 2 70 fps 47 fps
Hardware 125 fps 46 fps