P2S-Video

Introduction

There are two ways to model the output space of visual recognition tasks like object detection and segmentation. The conventional approach is to use a continuous-valued and fixed-sized feature map. A newer alternative is to use a variable-length sequence of discrete tokens that are output by the network one at a time through autoregression. The token-based approach combines elements from both computer vision and natural language processing and is inspired by image and video captioning models that take one or more images as input to produce a variable-length sentence as output.

This kind of discrete token-based representation has two main advantages. Firstly, it allows us to succinctly model the output space in domains where the output-size is highly variable. For example, the number of objects in a video can vary widely across videos and the fixed-sized representation used in conventional object-detectors introduces a great deal of sparsity in the loss computation which makes training difficult. Secondly, it eliminates the need to perform heuristics-based postprocessing on the raw network output to convert it into a form suitable for downstream processing, thereby allowing true end-to-end training. For example, usable bounding boxes can be constructed directly from the tokens output by a token-based detector without having to perform confidence thresholding and non-maximum suppression.

This project presents a novel way to perform token-based video object detection and semantic segmentation by extending the Pix2Seq framework from Google.

All the models described here are implemented in this repo.

Pix2Seq

The Pix2Seq framework consists of a series of papers from Google DeepMind to tokenize various visual recognition tasks:

The original paper dealt with object detection in static images and used an encoder-decoder transformer-based architecture with an autoregressive output decoder:

Ting Chen, et al. " Pix2Seq: A Language Modeling Framework for Object Detection ", ICLR 2022

It was then extended for instance segmentation, key point detection and image captioning in a multi-task variant of the original model:

Ting Chen, et al. " A Unified Sequence Interface for Vision Tasks ", NIPS 2022

It was also adapted for dense video captioning by a different group at Google:

A. Yang, et al. " Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning ", CVPR 2023

This was followed by Bit Diffusion which replaces the autoregressive transformer decoder with a diffusion module to be able to do dense prediction:

Ting Chen, et al. " Analog Bits: Generating Discrete Data using Diffusion Models with Self-Conditioning ", ICLR 2023

Finally, this diffusion model idea was extended to do panoptic segmentation in images and videos:

Ting Chen, et al. " A Generalist Framework for Panoptic Segmentation of Images and Videos ", ICCV 2023

Although this last paper does include video segmentation which overlaps with my work, the video component here is very rudimentary and seems rather like an afterthought. It only supports pairs of consecutive frames and does not use any video-specific architectures. Also, it uses diffusion while my models use autoregression like the first two papers.
The second paper with multi-task learning also includes segmentation as one of the tasks but it is performed on a per-instance basis where the bounding box coordinates of a single object are taken as input to generate a mask only for that object. My method, on the contrary, generates a semantic segmentation mask for the entire image (or video) without any concept of object instances.

Static Object Detection

Pix2Seq performs object detection in images by representing each object by 5 tokens - 4 for the bounding box corner coordinates and 1 for the class. The tokens for all the objects in the image are output sequentially so that the total number of output tokens = 5 * number of objects. The end of all objects is marked by the special end of sentence (EOS) token. An example is shown in this animation from their blog post:

Following are a couple more examples from UA-DETRAC and IPSC datasets respectively. Unlike the above image, token and bounding box colors here represent instances rather than classes. IPSC has two classes - IPSC and differentiating - that are respectively shown with green and red masks in the image. UA-DETRAC has only one class - vehicle.

As mentioned in the introduction, the main trick here is to output objects one at a time through autoregression unlike conventional detectors that output all objects at the same time.in This allows arbitrarily varying numbers of boxes to be produced without any redundancy in the output size and therefore the loss function during training.
It is remarkable that this system works without requiring any localization cues during training so we do not need to sample the space of possible bboxes using methods like anchor boxes. This in turn makes it particularly suitable for extension to video object detection and multi-object tracking since the space of possible objects in a video is exponentially larger than the space of possible boxes in an image which makes sampling it impractical.

Video Object Detection

I adapted this system for video object detection by modifying the network architecture to take N images from a video as input instead of a single static image.
In theory, N could be large enough to cover the entire video but, in practice, it is limited by the available GPU memory. N=16 is the maximum that can be trained on a 24 GB RTX 3090 without freezing any layers although it should be possible to get close to 50 with an 80 GB Tesla A100. With the backbone frozen, it is possible to go as high as N=64 on an RTX 3090, though this limits the batch sizes to be too small to learn something as complex as the the set of all objects in a 64-frame temporal window. I have successfully trained models upto N=32 although I did not observe any performance improvements beyond N=8, probably due to the same problem of too small batch sizes which is likely exacerbated by the frozen backbone and the relatively small dataset. The training time also increases significantly as N is increased so that also constrains how large N can be made.

Temporal Windows

Since we cannot process the entire video at once, we divide it into N-frame temporal windows which are processed one at a time in a sliding window manner.
There are a couple of hyperparameters we can adjust to construct better temporal windows. One is the temporal stride which is the number of frames that separate two consecutive windows we can use a stride < N to introduce redundancy during inference which can help to deal with false negatives. For example, with N=3 and stride=1, we get frames 1, 2, 3 in the first temporal window, 2,3,4 in the second one, 3,4,5 in the third one and so on.
Temporal window stride

As a result, we have detection outputs for frame 2 from two different temporal windows while the outputs for frame 3 onwards come from three different windows. These redundant outputs can be combined using non-maximum suppression to fill-in objects that might have been missed in individual temporal windows but not all of them.
The second parameter is the frame gap between successive frames in the same temporal window We can increase it as a form of data-augmentation. For example, with N=6, stride=3 and frame gap=2, we get temporal windows:
Temporal frame gap

Tokenization

Each object is represented by N*4+1 tokens - 4 tokens per frame for the bounding box corner coordinates and an extra token at the end for the class.
For example, we need 9 tokens per object for N=2:
N=2

This is another factor, in addition to GPU memory and training time, that imposes a limit on how large we can make N. With N=50, we would need 200 tokens to represent just one object and several thousands for tens of objects.
The Pix2Seq framework has a default limit of 512 on the maximum number of tokens produced by the network. It can be increased but, in my experience, this causes training time to increase too even if N is not changed.
It is possible that an object is not present in each of the N frames, so we have a special token NA to denote non-existence. Missing objects usually happen because an object either enters or leaves the scene in the middle of the temporal window though sometimes it can also happen if the object is briefly occluded in the middle of the window.
NA token

While decoding the tokens generated by the network during inference, if any one of the 4 tokens corresponding to a frame is NA, we assume the object is not present in that frame.

Visualization

Following is an example of video detection tokenization with N=2 and frame gap=10 on a sequence from the UA-DETRAC dataset:

Following is another example from the same dataset with N=6 where we can see many instances of objects being present for only part of the temporal window:

Following is an example from the IPSC dataset with N=2 and frame gap=5, showing multiple classes:

Following is another example from the same dataset with N=3 and frame gap=5, showing multiple classes as well as cell-fusion:

Semantic Segmentation

Run Length Encoding (RLE)

I have chosen to tokenize semantic segmentation masks using the run-length encoding (RLE) representation.
This is a lossless encoding method that flattens the segmentation mask into a vector and represents it as a sequence of runs.
A run is a continuous sequence of non-zero values that can be represented by a pair of values - start, length - where start is the index of the first pixel in the sequence and length is the number of pixels in the sequence.
A binary segmentation mask can thus be represented by a sequence of these pairs:
Binary RLE

while a multi-class mask would need a sequence of triplets since each run would also have the class ID:
Multi-Class RLE

I also considered alternative representations including polygons and quadtrees but statistical tests showed that all the representations use roughly the same number of tokens and RLE is much easier to implement, especially when generalizing to multiclass and video segmentations. RLE is also likely to be more robust to noisy tokens during mask reconstruction at inference since a single run, which forms the geometrical unit of RLE tokenization, has a much smaller impact on the overall mask quality than a polygon or quadtree.

Tokenization

I have determined experimentally that, like most transformer-based language models, Pix2Seq is very difficult to train from scratch on the relatively small datasets that I am working with. Therefore, we want to be able to use as much of the pre-trained weights as possible, which requires that the architecture is modified as little as possible. Specifically, there are a couple of architectural constraints that we need to consider – the size of the vocabulary and the maximum sequence length.

Out of the box, Pix2Seq has a maximum sequence length of 512 and a vocabulary size of 3K for the object detection model though the multi-task version supports vocabulary sizes upto 32K. Choosing how to tokenize RLE involves a trade-off between these two constraints, where one can be decreased at the cost of increasing the other. Also, both the training time and GPU memory requirement increase when either goes up but this increase is more pronounced for the maximum sequence length. Hence, my overall objective was to keep the RLE sequence as short as possible while allowing the number of required tokens to increase up to 32K. I have experimented with RLE sequence lengths up to 4096 and vocabulary sizes up to 28K and they seem to work.

Let us assume that we have an S x S segmentation mask. Now, if we use the conventional representation of flattening the mask into a 1D vector, we can use a single token to represent the starts but in this case we would need S² different tokens for the start indices. Alternately, we can skip the flattening and just use the 2D coordinates to represent the start of each run in which case we need only S start tokens but now we would need 3 tokens to represent each run As mentioned above, reducing the total number of tokens in the sequence seems to be more important than reducing the size of the vocabulary so I have been working with the flattened version.

Further, we have the option to share the same set of tokens for both starts and lengths since the lengths can theoretically be as large as the starts when a single run covers the entire mask. However, very few runs extend across multiple rows in practice so I decided to use separate tokens for length and imposed a limit of S on the maximum length that a run can have. Runs that exceed S can be split into multiple runs. For example with S=80, an overlong run (300, 125) can be split into two runs (300, 80) and (380,45). In theory, this can increase the total number of runs (and therefore the maximum sequence length) dramatically but, as mentioned above, it is very rare for runs to extend across multiple rows so it does not matter much in practice.

Sliding Windows

I collected statistics on the length of the RLE sequence required to represent segmentation masks for full images over the entire IPSC dataset. This turned out to be well over 512 for many images even when they were resized down to 320 x 320, which itself reduced the resolution of several IPSC sequences by a factor of more than 10. I therefore decided to train the model on smaller patches extracted from those images in a sliding window manner instead of the full images directly. This is similar to how I handled over-large images in my river ice segmentation paper:

I found empirically that patch dimensions ranging from from 1/4 to 1/8 of the full image are small enough to produce RLE sequences of suitable sizes as long as the masks themselves are subsampled to between 80 x 80 and 160 x 160. I am using the smallest version of the Pix2Seq architecture which takes 640 x 640 images as input which gave the target size for these patches. In order for the 640 x 640 patches to be 1/4 of the size of the full image, I resized the images to 2560 x 2560 which is also large enough so that we do not lose much resolution in most of the IPSC sequences.

I then subsampled the 640 x 640 masks down to either 80 x 80 or 160 x 160 and generated the RLE training data using these. I also applied all of the patch dataset augmentation techniques from the river ice segmentation paper including random strides smaller than S to have overlapping patches, and applying random rotation along with horizontal and vertical flipping to the patches thus generated.

Lengths-As-Class (LAC)

As mentioned before, multi-class segmentation masks require 3 tokens per run since we need an extra token for the class ID. This can unnecessarily increase the sequence length so we can go back to using 2 tokens per run by combining the length and class tokens into a single composite token that represents both.
This can be done by considering the length as classes too, whereby each unique combination of lengths and class IDs corresponds to a separate class so that the total number of classes is the product of the maximum run length (i.e. S) and the number of classes. The first S LAC tokens then correspond to runs of class 1, next S tokens correspond to class 2 and so on.
The IPSC dataset has 2 classes so that, with S=80, we get the number of starts tokens = 80*80 = 6400, number of LAC tokens = 80*2 = 160 and vocabulary size = 6400 + 160 = 6560.

Visualization

Following are some visualizations of the semantic segmentation tokenization process.
Binary Mask with Row-Major Flattening:

Binary mask with column-major flattening:

Multi-class mask with separate class tokens:

Multi-class mask with LAC tokens:

Video Segmentation

A straightforward extension of this idea to perform semantic segmentation on videos would involve flattening the N x S x S 3D mask into a 1D vector of size N*S² using either row-major or column-major ordering
Row-major or C-ordering (3D-C) results in the runs for individual images in the video simply getting concatenated together, i.e. all the runs for the first image come together in a sequence, followed by the runs for the second image and so on. This does not account for spatiotemporal consistencies in the masks since the runs corresponding to the same physical object from different images are completely unrelated as shown in the following visualizations for N=2:

Column-major or Fortran ordering (3D-F) partially accounts for spatiotemporal mask consistency but has large numbers of very short runs due to small offsets in the mask between consecutive frames. This not only significantly increases the sequence length but the very short, often unit sized, runs are also likely to be difficult to train on.

Another problem with this straightforward 3D flattening is that the number of starts tokens increases linearly with N and becomes impractical for N > 5.

Time-As-Class (TAC)

These issues can be largely resolved by combining the temporal dimension with class IDs so that every possible combination of class IDs across the video frames is considered a separate class.
Following are the TAC classes for both binary and multi-class cases with N=2 and N=3, where bkg refers to background while ips and dif are the two classes of cells in the IPSC dataset:
TAC

Due to this combinatorial approach, the total number of TAC tokens becomes (C+1)^N-1, where C is the number of foreground classes. Even though this increases exponentially with N, the small exponent bases of 2 and 3 for binary and multi-class cases enable the vocabulary size to remain practical for up to N=12 and N=9 respectively. This is helped by the fact that the number of starts tokens now becomes independent of N since the 3D video mask with C classes has effectively been collapsed into a 2D mask with (C+1)^N-1 classes.
Following are visualizations for TAC tokenization for both N=2 and N=3. The two masks on the top right are the TAC masks before and after subsampling, having respective resolutions of 640 x 640 and 80 x 80.
Binary mask with N=2 (TAC classes = 3)

Binary mask with N=3 (TAC classes = 7)

Multi-Class mask with N=2 (TAC classes = 8)

Multi-class mask with N=3 (TAC classes = 26)

Length-and-Time-As-Class (LTAC)

TAC and LAC tokenization techniques can be combined to represent each run with only 2 tokens for video masks too. However, the number of LTAC tokens = S*(C+1)^N-1 becomes impractical for N > 8 and N > 5 respectively for binary and multi-class cases. Following are visualizations of LTAC tokenization for both cases with N=2.

Class-wise Tokenization

The problem of exponential blowup in the vocabulary size V required for TAC and LTAC tokenization schemes when the number of classes becomes high (e.g. C=80 in COCO) can be ameliorated by decoupling the class ID from the RLE sequence so that the latter is composed only of starts and lengths. This can be achieved by generating RLE tokens for the binary mask corresponding to each class separately and then concatenating these RLE sequences, separated by the respective class tokens to mark the end of each sequence. The class tokens therefore serve the dual purpose of separating the RLE sequences for the different classes and specifying the classes themselves. In this way, the number of classes C for TAC and LTAC vocabulary sizes are always fixed at 2 irrespective of the actual number of classes in the dataset.
On the flip side, this representation at least partially nullifies the advantage of small quantum that RLE provides as far as classification accuracy is concerned. When each run is classified separately, even if a few of them are misclassified, it does not affect the overall classification accuracy greatly. However, when all the runs corresponding to a class are classified by a single token, an error in the latter causes all of those runs to become misclassified. Related to this is the fact that the class tokens are greatly outweighed by the coordinate tokens since there are anywhere from a few tens to a few hundreds of coordinate tokens for every class token. This last problem needs to be handled by downweighing the coordinate tokens so that all the coordinate tokens combined have about the same weightage as the single class token.

Instance-wise Tokenization

If object instance information is available, RLE sequences can instead be generated for the binary masks corresponding to each object and these sequences can again concatenated, separated by the respective class tokens, as above. This representation allows us to perform instance segmentation in addition to semantic segmentation and therefore achieve full video panoptic segmentation. This also partially solves the problem of weight imbalance between coordinate and class tokens because, subject to the number and size of objects in the dataset, the number of runs required to represent each object is usually much smaller than those needed to represent all the pixels belonging to each class.
Following is an example of this instance-wise tokenization:

This idea occured to me very late in the development of this project and I have only implemented and tested it for static segmentation so far. Early results have shown it to be comparable to the other tokenization schemes in terms of semantic segmentation performance. However, its instance segmentation performance does not compare favourably with object detection models, mainly because of the low resolution of the mask. Training with sufficiently high batch sizes is currently only possible with S=80 and S=128 using respective image sizes of $I=640$ and $I=1024$. Given the high resolution of the IPSC images (Table \ref{tab_ipsc_stats}), this requires the original masks to be downsampled by a factor of anywhere from 10 to 50 which is far too large to be able to extract spatially accurate bounding boxes. I have trained a couple of models with S=512and S=640 too but I had to use batch sizes that are too small to train these successfully. As a result, these models underperform in terms of both semantic and instance segmentation. I have left further exploration of these tokenization schemes as part of future work when more GPU memory is available.

Architectures

Static Image Input

Pix2Seq supports two backbone architectures - ResNet-50 and VIT - each with three input sizes - 640 x 640, 1024 x 1024 and 1333 x 1333. I have only worked with the smallest version - 640 x 640 ResNet-50 - so as to maximize N with the 24GB GPU memory available to me. The network itself has a transformer based encoder-decoder architecture where the encoder employers only self-attention with image features while the autoregressive decoder has both self-attention with sequence features and cross-attention between sequence and image features.

Encoder

The encoder takes the 640 x 640 RGB image as input and applies multi-headed self-attention to convert it into 400 x 256 features. This diagram summarizes the encoder architecture:

Image Encoder

Here, B is the batch size and the Flatten Feat module refers to the spatial flatenning of the 20 x 20 feature maps from the ResNet-50 backbone into 1D feature vectors of size 400. The Self MHA x 6 module refers to the multi-headed attention module that applies self-attention to the image features which is repeated 6 times.

Decoder

The decoder takes the 400 x 256 image features from the encoder along with the sequence tokens. It then applies multi-headed self-attention to the sequence embedding features followed by cross-attention between the sequence and image features. This self+cross MHA module is also repeated 6 times like the encoder.
This diagram summarizes the decoder architecture:
Image Decoder

Video Input

As mentioned before, I wanted to make as few changes to this baseline architecture as possible in order to be able to use most of the pre-trained weights. I have come up with three ways to adapt it for processing videos that differ in the stage of the pipeline at which the features corresponding to individual video frames are fused together.

Early Fusion

This method replaces the ResNet-50 backbone with a video-specific backbone such as the Video Swin Transformer or 3D-ResNet so that the feature fusion happens within the backbone itself. I have only experimented with Video Swin Transformer so far.
Early-Fusion

This method only changes the number of backbone feature maps from 2048 to 768 while the rest of the pipeline remains unchanged. This means that we cannot use pretrained weights only for the projection MLP that projects the flattened backbone features to 400 x 256. Of course, we also lose the pretrained weights for the backbone itself but the video swin transformer implementation I am using comes with its own weights which seem to work well enough.

Middle Fusion

Here, we first flatten the temporal dimension along the batch dimension to replace B with B*N images and feature maps while leaving the rest of the encoder pipeline unchanged so that all of the operations are performed to each one of the video frames independently.
Middle-Fusion

Once we have the BN x 400 x 256 output from the self-MHA module, we unflatten the temporal dimension to separate out the 400 x 256 features for each one of the N video frames. We then apply pairwise compositional cross-attention to the video frames. This means that we apply cross-attention first between the features of images 1 and 2, then between the output of this operation and image 3 features, then between the output of this operation and image 4 features and so on.
Middle-Fusion Cross-MHA

This method allows us to use all the pretrained weights but we have to learn the compositional cross-MHA weights from scratch. This can sometimes lead to a bit of instability during training especially for larger values of N though early results do indicate that this one has the best performance among the three fusion methods.

Late Fusion

This is identical to middle-fusion as far as the generating the BN x 400 x 2048 backbone features after which the temporal dimension is unflattened to separate out frame-specific features. 3-D position embedding is then added to these to encode information about the position of each frame within the video. The features for all the frames in each video are then concatenated together to create a feature map of size B x 400N x 2048 which is processed normally for the remainder of the encoder pipeline except that now we have 400*N features instead of 400.

Late-Fusion Encoder

Each of these 400*N image features is then cross-attended with each of the 500 sequence features in the decoder so that every single frame is directly able to attend to every single output token.

Late-Fusion Decoder

Results

Training

I performed most of the training on three GPU servers, each with $2 \times$ Geforce RTX 3090 24GB GPUs. Some of the models were also trained on a fourth GPU server with $3 \times$ Geforce GTX 1080Ti 11GB GPUs. A fifth GPU server with a Geforce RTX 3090 24GB and a Geforce RTX 3060 12GB was used for inference.
Pix2Seq codebase does not support performing validation as part of the training run. While I did add support for this, I found that it not only slowed down training significantly, but also caused random crashes due to running out of GPU memory. As a result, I created a remote validation system where the inference server periodically queries the training server for new checkpoints and runs inference on the latest checkpoint thus found. The time period between successive queries is two hours or the inference time, whichever is more. Since inference requires much less GPU memory than training, it is possible to simultaneously perform validation for multiple training runs on the same inference server.
A few models were trained in a distributed manner or all three RTX 3090 servers to be able to use the combined 144 GB of GPU memory. It would have been extremely beneficial to be able to use all the $144$ GB GPU memory for more of the models, especially on larger datasets like animal detection and UA-DETRAC. However, the overhead of distributed training was so high due to relatively slow network connections that the GPU usage during these runs was $< 50\%$ nearly the entire time, and the GPUs spent a good fraction of that time idling as shown in this snapshot video, combining nvtop and htop:

This in turn extended the training time significantly, often to several weeks, and made it impracticable for more than a few models.

Video Detection and Segmentation with Language Modeling

Introduction

Pix2Seq

Static Object Detection

Video Object Detection

Temporal Windows

Tokenization

Visualization

Semantic Segmentation

Run Length Encoding (RLE)

Tokenization

Sliding Windows

Lengths-As-Class (LAC)

Visualization

Video Segmentation

Time-As-Class (TAC)

Length-and-Time-As-Class (LTAC)

Class-wise Tokenization

Instance-wise Tokenization

Architectures

Static Image Input

Encoder

Decoder

Video Input

Early Fusion

Middle Fusion

Late Fusion

Results

Training