Motion and Appearance Based Moving Object Detection Network for Autonomous Driving
For autonomous driving, moving objects like vehicles and pedestrians are of critical importance as they primarily influence the maneuvering and braking of the car. Typically, they are detected by motion segmentation of dense optical flow augmented by a CNN based object detector for capturing semantics. In this paper, our aim is to jointly model motion and appearance cues in a single convolutional network. We propose a novel two-stream architecture for joint learning of object detection and motion segmentation. We designed three different flavors of our network to establish systematic comparison. It is shown that the joint training of tasks significantly improves accuracy compared to training them independently. Although motion segmentation has relatively fewer data than vehicle detection. The shared fusion encoder benefits from the joint training to learn a generalized representation.
We created our own publicly available annotation on KITTI (KITTI MoSeg-KITTI Motion Segmentation), by extending KITTI raw sequences labelled with detections to obtain static/moving annotations on the vehicles. We compared against MPNet as a baseline, which is the current state of the art for CNN-based motion detection. It is shown that the proposed two-stream architecture improves the mAP score by 21.5% in KITTI MOD. We also evaluated our algorithm on the non-automotive DAVIS dataset and obtained accuracy close to the state-of-the-art performance. The proposed network runs at 8 fps on a Titan X GPU using a basic VGG16 encoder.
We suggest a pipeline to automatically generate static/moving classification for objects. The procedure uses odometry information and annotated 3D bounding boxes for vehicles. The odometry information that includes GPS/IMU readings provides a method to compute the velocity of the moving camera. The 3D bounding boxes of the annotated vehicles are projected to 2D images and tagged with their corresponding 3D centroid. The 2D bounding boxes are associated between consecutive frames using intersection over union. The estimated vehicles velocities are then computed based on the associated 3D centroids. The computed velocity vector per bounding box is compared to the odometry ground-truth to determine the static/moving classification of vehicles. The objects that are then consistently identified on multiple frames as moving are kept. In this dataset, the focus is on vehicles with car, truck, and van object categories. Finally, dilated convolution trained on cityscapes is used to obtain weakly annotated motion masks.
Please refer to this paper for more details on the method and experimental results on both DAVIS and KITTI MoSeg
We provide the KITTI MoSeg annotation that was used in this work. Note that the provided weakly annotated segmentation masks were not the ones used in the paper. However, the bounding box ground truth and its static/moving classification provided here is the one used during training and evaluation. This is why in the paper, the data was mainly named KITTI MOD since the annotation was mainly for detection, later on it was augmented for better segmentation masks.:
Please cite these papers when this dataset is used: