Video Detection and Segmentation with Language Modeling


Introduction

There are two ways to model the output space of visual recognition tasks like object detection and segmentation. The conventional approach is to use a continuous-valued and fixed-sized feature map. A newer alternative is to use a variable-length sequence of discrete tokens that are output by the network one at a time through autoregression. The token-based approach combines elements from both computer vision and natural language processing and is inspired by image and video captioning models that take one or more images as input to produce a variable-length sentence as output.

This kind of discrete token-based representation has two main advantages. Firstly, it allows us to succinctly model the output space in domains where the output-size is highly variable. For example, the number of objects in a video can vary widely across videos and the fixed-sized representation used in conventional object-detectors introduces a great deal of sparsity in the loss computation which makes training difficult. Secondly, it eliminates the need to perform heuristics-based postprocessing on the raw network output to convert it into a form suitable for downstream processing, thereby allowing true end-to-end training. For example, usable bounding boxes can be constructed directly from the tokens output by a token-based detector without having to perform confidence thresholding and non-maximum suppression.

This project presents a novel way to perform token-based video object detection and semantic segmentation by extending the Pix2Seq framework from Google.

Pix2Seq

The Pix2Seq framework consists of a series of papers from Google DeepMind to tokenize various visual recognition tasks: The original paper dealt with object detection in static images and used an encoder-decoder transformer-based architecture with an autoregressive output decoder: It was then extended for instance segmentation, key point detection and image captioning in a multi-task variant of the original model: It was also adapted for dense video captioning by a different group at Google: This was followed by Bit Diffusion which replaces the autoregressive transformer decoder with a diffusion module to be able to do dense prediction: Finally, this diffusion model idea was extended to do panoptic segmentation in images and videos: Although this last paper does include video segmentation which overlaps with my work, the video component here is very rudimentary and seems rather like an afterthought. It only supports pairs of consecutive frames and does not use any video-specific architectures. Also, it uses diffusion while my models use autoregression like the first two papers.
The second paper with multi-task learning also includes segmentation as one of the tasks but it is performed on a per-instance basis where the bounding box coordinates of a single object are taken as input to generate a mask only for that object. My method, on the contrary, generates a semantic segmentation mask for the entire image (or video) without any concept of object instances.

Static Object Detection

Pix2Seq performs object detection in images by representing each object by 5 tokens - 4 for the bounding box corner coordinates and 1 for the class. The tokens for all the objects in the image are output sequentially so that the total number of output tokens = 5 * number of objects. The end of all objects is marked by the special end of sentence (EOS) token. An example is shown in this animation from their blog post:

Pix2Seq Static Object Detection

As mentioned in the introduction, the main trick here is to output objects one at a time through autoregression instead of all objects at the same time as in standard detectors so that arbitrarily varying numbers of boxes can be produced without any redundancy in the output size and the corresponding loss function used for training.
It is remarkable that this system works without requiring any localization cues during training so we do not need to sample the space of possible bboxes using methods like anchor boxes. This in turn makes it particularly suitable for extension to video object detection since the space of possible objects in a video is exponentially larger than the space of possible boxes in an image which makes sampling it impractical.

Video Object Detection

I adapted this system for video object detection by modifying the network architecture to take N images from a video as input instead of a single static image.
In theory, N could be large enough to cover the entire video but, in practice, it is limited by the available GPU memory. The maximum I have been able to get working with a 24 GB RTX 3090 is N=16 although it should be possible to get close to 50 with an 80 GB Tesla A100. The training time also increases significantly as N is increased so that would also constrain how large N can be made.

Temporal Windows

Since we cannot process the entire video at once, we divide it into N-frame temporal windows which are processed one at a time in a sliding window manner.
There are a couple of hyperparameters we can adjust to construct better temporal windows. one is the temporal stride which is the number of frames that separate two consultative windows we can use a stride < N to introduce redundancy during inference which can help to deal with false negatives. For example, with N=3 and stride=1, we get frames 1, 2, 3 in the first temporal window, 2,3,4 in the second one, 3,4,5 in the third one and so on.
Temporal window stride
As a result, we have detection outputs for frames 2 and 3 from two different temporal windows while the outputs for frame 3 comes from three different windows.
The second parameter is the frame gap between successive frames in the same temporal window We can increase it as a form of data-augmentation. For example, with N=6, stride=3 and frame gap=2, we get temporal windows:
Temporal frame gap

Tokenization

Each object is represented by N*4+1 tokens - 4 tokens per frame for the bounding box corner coordinates and an extra token at the end for the class.
For example, we need 9 tokens per object for N=2:
N=2
This is another factor, in addition to GPU memory and training time, that imposes a limit on how large we can make N. With N=50, we would need 200 tokens to represent just one object and several thousands for tens of objects.
The Pix2Seq framework has a default limit of 512 on the maximum number of tokens produced by the network. It can be increased but, in my experience, this causes training time to increase too even if N is not changed.
It is possible that an object is not present in each of the N frames, so we have a special token NA to denote non-existence. Missing objects usually happen because an object either enters or leaves the scene in the middle of the temporal window though sometimes it can also happen if the object is briefly occluded in the middle of the window.
NA token
While decoding the tokens generated by the network during inference, if any one of the 4 tokens corresponding to a frame is NA, we assume the object is not present in that frame.

Visualization

Following is an example of video detection tokenization with N=2 and frame gap=10 on a sequence from the UA-DETRAC dataset:



Following is another example from the same dataset with N=6 where we can see many instances of objects being present for only part of the temporal window:



Following is an example from the IPSC dataset with N=3 and frame gap=5, showing multiple classes as well as cell-fusion.



Semantic Segmentation

Run Length Encoding (RLE)

I have chosen to tokenize semantic segmentation masks using the run-length encoding (RLE) representation.
This is a lossless encoding method that flattens the segmentation mask into a vector and represets it as a sequence of runs.
A run is a continuous sequence of non-zero values that can be represented by a pair of values - start, length - where start is the index of the first pixel in the sequence and length is the number of pixels in the sequence.
A binary segmentation mask can thus be represented by a sequence of these pairs:
Binary RLE
while a multi-class mask would need a sequence of triplets since each run would also have the class ID:
Multi-Class RLE
I also considered alternative representations including polygons and quadtrees but statistical tests showed that all the representations use roughly the same number of tokens and RLE is much easier to implement, especially when generalizing to multiclass and video segmentations. RLE is also likely to be more robust to noisy tokens during mask reconstruction at inference since a single run, which forms the geometrical unit of RLE tokenization, has a much smaller impact on the overall mask quality than a polygon or quadtree.

Tokenization

I have determined experimentally that, like most transformer-based language models, Pix2Seq is very difficult to train from scratch on the relatively small datasets that I am working with. Therefore, we want to be able to use as much of the pre-trained weights as possible, which requires that the architecture is modified as little as possible. Specifically, there are a couple of architectural constraints that we need to consider – the size of the vocabulary and the maximum sequence length.

Out of the box, Pix2Seq has a maximum sequence length of 512 and a vocabulary size of 3K for the object detection model though the multi-task version supports vocabulary sizes upto 32K. Choosing how to tokenize RLE involves a trade-off between these two constraints, where one can be decreased at the cost of increasing the other. Also, both the training time and GPU memory requirement increase when either goes up but this increase is more pronounced for the maximum sequence length. Hence, my overall objective was to keep the RLE sequence as short as possible while allowing the number of required tokens to increase up to 32K. I have experimented with RLE sequence lengths up to 4096 and vocabulary sizes up to 28K and they seem to work.

Let us assume that we have an S x S segmentation mask. Now, if we use the conventional representation of flattening the mask into a 1D vector, we can use a single token to represent the starts but in this case we would need S2 different tokens for the start indices. Alternately, we can skip the flattening and just use the 2D coordinates to represent the start of each run in which case we need only S start tokens but now we would need 3 tokens to represent each run As mentioned above, reducing the total number of tokens in the sequence seems to be more important than reducing the size of the vocabulary so I have been working with the flattened version.

Further, we have the option to share the same set of tokens for both starts and lengths since the lengths can theoretically be as large as the starts when a single run covers the entire mask. However, very few runs extend across multiple rows in practice so I decided to use separate tokens for length and imposed a limit of S on the maximum length that a run can have. Runs that exceed S can be split into multiple runs. For example with S=80, an overlong run (300, 125) can be split into two runs (300, 80) and (380,45). In theory, this can increase the total number of runs (and therefore the maximum sequence length) dramatically but, as mentioned above, it is very rare for runs to extend across multiple rows so it does not matter much in practice.

Sliding Windows

I collected statistics on the length of the RLE sequence required to represent segmentation masks for full images over the entire IPSC dataset. This turned out to be well over 512 for many images even when they were resized down to 320 x 320, which itself reduced the resolution of several IPSC sequences by a factor of more than 10. I therefore decided to train the model on smaller patches extracted from those images in a sliding window manner instead of the full images directly. This is similar to how I handled over-large images in my river ice segmentation paper:





I found empirically that patch dimensions ranging from from 1/4 to 1/8 of the full image are small enough to produce RLE sequences of suitable sizes as long as the masks themselves are subsampled to between 80 x 80 and 160 x 160. I am using the smallest version of the Pix2Seq architecture which takes 640 x 640 images as input which gave the target size for these patches. In order for the 640 x 640 patches to be 1/4 of the size of the full image, I resized the images to 2560 x 2560 which is also large enough so that we do not lose much resolution in most of the IPSC sequences.

I then subsampled the 640 x 640 masks down to either 80 x 80 or 160 x 160 and generated the RLE training data using these. I also applied all of the patch dataset augmentation techniques from the river ice segmentation paper including random strides smaller than S to have overlapping patches, and applying random rotation along with horizontal and vertical flipping to the patches thus generated.

Lengths-As-Class (LAC)

As mentioned before, multi-class segmentation masks require 3 tokens per run since we need an extra token for the class ID. This can unnecessarily increase the sequence length so we can go back to using 2 tokens per run by combining the length and class tokens into a single composite token that represents both.
This can be done by considering the length as classes too, whereby each unique combination of lengths and class IDs corresponds to a separate class so that the total number of classes is the product of the maximum run length (i.e. S) and the number of classes. The first S LAC tokens then correspond to runs of class 1, next S tokens correspond to class 2 and so on.
The IPSC dataset has 2 classes so that, with S=80, we get the number of starts tokens = 80*80 = 6400, number of LAC tokens = 80*2 = 160 and vocabulary size = 6400 + 160 = 6560.

Visualization

Following are some visualizations of the semantic segmentation tokenization process.
Binary Mask with Row-Major Flattening:


Binary mask with column-major flattening:


Multi-class mask with separate class tokens:


Multi-class mask with LAC tokens:

Video Segmentation

A straightforward extension of this idea to perform semantic segmentation on videos would involve flattening the N x S x S 3D mask into a 1D vector of size N*S2 using either row-major or column-major ordering
Row-major or C-ordering (3D-C) results in the runs for individual images in the video simply getting concatenated together, i.e. all the runs for the first image come together in a sequence, followed by the runs for the second image and so on. This does not account for spatiotemporal consistencies in the masks since the runs corresponding to the same physical object from different images are completely unrelated as shown in the following visualizations for N=2:


Column-major or Fortran ordering (3D-F) partially accounts for spatiotemporal mask consistency but has large numbers of very short runs due to small offsets in the mask between consecutive frames. This not only significantly increases the sequence length but the very short, often unit sized, runs are also likely to be difficult to train on.


Another problem with this straightforward 3D flattening is that the number of starts tokens increases linearly with N and becomes impractical for N > 5.

Time-As-Class (TAC)

These issues can be largely resolved by combining the temporal dimension with class IDs so that every possible combination of class IDs across the video frames is considered a separate class.
Following are the TAC classes for both binary and multi-class cases with N=2 and N=3, where bkg refers to background while ips and dif are the two classes of cells in the IPSC dataset:
TAC

Due to this combinatorial approach, the total number of TAC tokens becomes (C+1)N-1, where C is the number of foreground classes. Even though this increases exponentially with N, the small exponent bases of 2 and 3 for binary and multi-class cases enable the vocabulary size to remain practical for up to N=12 and N=9 respectively. This is helped by the fact that the number of starts tokens now becomes independent of N since the 3D video mask with C classes has effectively been collapsed into a 2D mask with (C+1)N-1 classes.
Following are visualizations for TAC tokenization for both N=2 and N=3. The two masks on the top right are the TAC masks before and after subsampling, having respective resolutions of 640 x 640 and 80 x 80.
Binary mask with N=2 (TAC classes = 3)


Binary mask with N=3 (TAC classes = 7)


Multi-Class mask with N=2 (TAC classes = 8)


Multi-class mask with N=3 (TAC classes = 26)


Length-and-Time-As-Class (LTAC)

TAC and LAC tokenization techniques can be combined to represent each run with only 2 tokens for video masks too. However, the number of LTAC tokens = S*(C+1)N-1 becomes impractical for N > 8 and N > 5 respectively for binary and multi-class cases. Following are visualizations of LTAC tokenization for both cases with N=2.



Architectures

Image

Pix2Seq supports two backbone architectures - ResNet-50 and VIT - each with three input sizes - 640 x 640, 1024 x 1024 and 1333 x 1333. I have only worked with the smallest version - 640 x 640 ResNet-50 - so as to maximize N with the 24GB GPU memory available to me. The network itself has a transformer based encoder-decoder architecture where the encoder employers only self-attention with image features while the autoregressive decoder has both self-attention with sequence features and cross-attention between sequence and image features.

Encoder

The encoder takes the 640 x 640 RGB image as input and applies multi-headed self-attention to convert it into 400 x 256 features. This diagram summarizes the encoder architecture:

Image Encoder

Here, B is the batch size and the Flatten Feat module refers to the spatial flatenning of the 20 x 20 feature maps from the ResNet-50 backbone into 1D feature vectors of size 400. The Self MHA x 6 module refers to the multi-headed attention module that applies self-attention to the image features which is repeated 6 times.

Decoder

The decoder takes the 400 x 256 image features from the encoder along with the sequence tokens. It then applies multi-headed self-attention to the sequence embedding features followed by cross-attention between the sequence and image features. This self+cross MHA module is also repeated 6 times like the encoder.
This diagram summarizes the decoder architecture:
Image Decoder

Video

As mentioned before, I wanted to make as few changes to this baseline architecture as possible in order to be able to use most of the pre-trained weights. I have come up with three ways to adapt it for processing videos that differ in the stage of the pipeline at which the features corresponding to individual video frames are fused together.

Early Fusion

This method replaces the ResNet-50 bnackbone with a video-specific backbone such as the Video Swin Transformer or 3D-ResNet so that the feature fusion happens within the backbone itself. I have only experimented with Video Swin Transformer so far.
Early-Fusion
This method only changes the number of backbone feature maps from 2048 to 768 while the rest of the pipeline remains unchanged. This means that we cannot use pretrained weights only for the projection MLP that projects the flattened backbone features to 400 x 256. Of course, we also lose the pretrained weights for the backbone itself but the video swin transformer implementation I am using comes with its own weights which seem to work well enough.

Middle Fusion

Here, we first flatten the temporal dimension along the batch dimension to replace B with B*N images and feature maps while leaving the rest of the encoder pipeline unchanged so that all of the operations are performed to each one of the video frames independently.
Middle-Fusion
Once we have the BN x 400 x 256 output from the self-MHA module, we unflatten the temporal dimension to separate out the 400 x 256 features for each one of the N video frames. We then apply pairwise compositional cross-attention to the video frames. This means that we apply cross-attention first between the features of images 1 and 2, then between the output of this operation and image 3 features, then between the output of this operation and image 4 features and so on.
Middle-Fusion Cross-MHA
This method allows us to use all the pretrained weights but we have to learn the compositional cross-MHA weights from scratch. This can sometimes lead to a bit of instability during training especially for larger values of N though early results do indicate that this one has the best performance among the three fusion methods.

Late Fusion

This is identical to middle-fusion as far as the generating the BN x 400 x 2048 backbone features after which the temporal dimension is unflattened to separate out frame-specific features. 3-D position embedding is then added to these to encode information about the position of each frame within the video. The features for all the frames in each video are then concatenated together to create a feature map of size B x 400N x 2048 which is processed normally for the remainder of the encoder pipeline except that now we have 400*N features instead of 400.

Late-Fusion Encoder

Each of of these 400*N image features is then cross-attended with each of the 500 sequence features in the decoder so that every single frame is directly able to attend to every single output token.

Late-Fusion Decoder

Results

I am still working on debugging parts of the inference and evaluation pipeline so the final quantitative results are not available yet but early qualitative results are very promising and suggest that both video detection and segmentation methods work well enough to be comparable to the state of the art.
Both qualitative and quantitative results will be added soon.