The scene is completed through multi-robot collaboration.

The completed scene can be directly used in any perception without fine-tuning.

Individual reconstruction

Scene completion

Object detection

Semantic segmantation

The results for object detection and semantic segmantation are obtained by directly feeding the completion output to singe-agent perception models without any fine-tuning. For object detection, the red boxes are predicted bounding boxes, and the green boxes are the ground truth.

Abstract

Collaborative perception learns how to share information among multiple robots to perceive the environment better than individually done. Past research on this has been task-specific, such as detection or segmentation. Yet this leads to different information sharing for different tasks, hindering the large-scale deployment of collaborative perception. We propose the first task-agnostic collaborative perception paradigm that learns a single collaboration module in a self-supervised manner for different downstream tasks. This is done by a novel task termed multi-robot scene completion, where each robot learns to effectively share information for reconstructing a complete scene viewed by all robots. Moreover, we propose a spatiotemporal autoencoder (STAR) that amortizes over time the communication cost by spatial sub-sampling and temporal mixing. Extensive experiments validate our method's effectiveness on scene completion and collaborative perception in autonomous driving scenarios.

Contribution

We propose a brand-new task-agnostic collaborative perception framework based on multi-robot scene completion, decoupling the collaboration learning from downstream tasks.

We propose asynchronous training and synchronous inference with a shared autoencoder to solve the proposed task, eliminating the need for synchronous data for collaboration learning.

We develop a novel spatiotemporal autoencoder (STAR) that reconstructs scenes based on temporally mixed information. It amortizes the spatial communication volume over time to improve the performance-bandwidth trade-off.

We conduct extensive experiments to verify our method's effectiveness for scene completion and downstream perception in autonomous driving scenarios.

Method

Asynchronous training and synchronous inference. In the top right, the asynchronous training does not require communication between robots, while synchronous training, shown in the bottom right, requires communication and optimized with regards to each specific task loss. The synchronous inference is illustrated on the left. The sender transmit encoded representations to the receiver. The receiver uses a mixture of spatio-temporal tokens to complete the scene observation.

Qualitative Results

Visualization of our scene completion results on different scenes of V2X-Sim [1]. The three rows from top to bottom each represent: the ground truth scene observation made by the multiple robots presented at the same scene at the same time, the completed scene output by our model based on each robot's individual obervation and their corresponding poses, and the residual, or difference, between the ground truth and the prediction, which is the part the model fails to predict.

Visualization of our object detection results on different scenes of V2X-Sim [1]. The ground truth and the completed scene observation are presented on the first two rows, and the detection results on the bottom row. The red boxes are predicted bounding boxes, and the green boxes are the ground truth.

Visualization of our semantic segmantation results on different scenes of V2X-Sim [1]. On the bottom row, different colors represent different semantic labels.

BibTeX

 @inproceedings{li2022multi,
      title={Multi-Robot Scene Completion: Towards Task-Agnostic Collaborative Perception},
      author={Li, Yiming and Zhang, Juexiao and Ma, Dekun and Wang, Yue and Feng, Chen},
      booktitle={6th Annual Conference on Robot Learning}
    }

Multi Robot Scene Completion: Towards Task-agnostic Collaborative Perception

CoRL 2022