Tracking & video

Motivation

The ieee1394 (firewire) interface allows for low cost, high bandwidth access to computer peripherals and is highly available in todays PC market. This bus allows us to receive reasonable quality video from a video source in real-time, which can then be used as input the rest of the capture system. Not only do these video frames provide us with texture for our objects, they also provide the information necessary to generate the structure for the objects. This structure is obtained from the images by establishing point correspondences in the images. Rather than having a user establish these correspondences directly, visual tracking is a semi-automatic process which provides these point correspondences in real-time, while using the temporal coherence in the input images, thus not requiring the entire video to be stored on disk.

Theory

Given the location and dimensions of a reference patch(in our case a user defined rectangle) in the initial frame, which we will call the template, the visual tracking process seeks to find the location of this patch in the following frames. If we assume that the motion of the patch can only be a translation in the x or y direction, then one could imagine attempting to overlay the reference template in the position after each possible movement (left,right,up,down,and each of the four diagonal directions) and determining which region of the current frame closest resembles the reference patch, then choose that as the new location of the patch. This is a simplified version of what the trackers actually do, below you will find both a more in depth description as well as a brief mathematical formulation of the problem.

template x y rotation scale aspect shear

As above, assume that the possible movement is only translation(2 degrees of freedom). Again we know the location and dimension of the template in the initial frame and would like to know where the patch is located in the following frames. Instead of searching the possible locations in the current frame by using the position in the previous frame (as the simplified version above had done), the trackers use the information at the previous position, in the current frame. To do this, it is necessary for the tracking procedure to compute what the partial derivatives of the patch are relative to both directions of translation(Above we have shown these partial derivatives for all 6 degrees of freedom, with only the x and y being applicable to our current discussion). The tracking then uses the position in the last frame, and subtracts the reference template from the image at that location. The result of this subtraction will be somewhat similar to what the partial derivatives look like. By finding the linear combination of the derivatives that best represents this subtracted region, we can determine the change in pose between frames. This can be formulated as a set of linear equations that can be solved in the least squares sense.

The visual tracking system uses a dynamic system implementation of a sum of squared differences (SSD) patch tracker, which calculates (here vectors are written bold, matrices capitalized and bold capital indicates an image flattened into a column vector) the image plane projection of a set of fiducial points. Given a small reference patch , the current image frame , and an estimate of the patch's approximate location (its location in the last frame), the patch's new position is calculated by solving for a correction and updating the current state :

Where is the sub-image corresponding to cut out from the current frame at coordinates and stacked into a column vector.

Implementation

We have implemented the video pipeline and tracking using public domain software libraries for IEEE 1394 network and camera drivers with the XVision real-time SSD trackers. Any IEEE 1394 compliant camera, from $100 web cams(eg Pyro, iBot) to professional high frame rate cameras can be used. The user first selects several important features in the initial scene, which should be appropriately placed to capture the geometric variation in the scene. These features are then tracked from frame to frame as described above. Since some of the features may be occluded in any given frame, the interface allows for trackers to be paused, deleted or new trackers to be added. The above theoretical discussion is somewhat limited in that it allows for at most one pixel movement of the features from one frame to the next. In practice it its possible that the tracked region has moved more than one pixel, so to ensure that the trackers are not lost, they are updated two to three times a frame until convergence.

The above screen capture shows an object being tracked along with the user interface

Experiments