Individual reconstruction
Scene completion
Object detection
Semantic segmantation
The results for object detection and semantic segmantation are obtained by directly feeding the completion output to singe-agent perception models without any fine-tuning. For object detection, the red boxes are predicted bounding boxes, and the green boxes are the ground truth.
Collaborative perception learns how to share information among multiple robots to perceive the environment better than individually done. Past research on this has been task-specific, such as detection or segmentation. Yet this leads to different information sharing for different tasks, hindering the large-scale deployment of collaborative perception. We propose the first task-agnostic collaborative perception paradigm that learns a single collaboration module in a self-supervised manner for different downstream tasks. This is done by a novel task termed multi-robot scene completion, where each robot learns to effectively share information for reconstructing a complete scene viewed by all robots. Moreover, we propose a spatiotemporal autoencoder (STAR) that amortizes over time the communication cost by spatial sub-sampling and temporal mixing. Extensive experiments validate our method's effectiveness on scene completion and collaborative perception in autonomous driving scenarios.
Asynchronous training and synchronous inference. In the top right, the asynchronous training does not require communication between robots, while synchronous training, shown in the bottom right, requires communication and optimized with regards to each specific task loss. The synchronous inference is illustrated on the left. The sender transmit encoded representations to the receiver. The receiver uses a mixture of spatio-temporal tokens to complete the scene observation.
 
Visualization of our scene completion results on different scenes of V2X-Sim [1]. The three rows from top to bottom each represent: the ground truth scene observation made by the multiple robots presented at the same scene at the same time, the completed scene output by our model based on each robot's individual obervation and their corresponding poses, and the residual, or difference, between the ground truth and the prediction, which is the part the model fails to predict.
Visualization of our object detection results on different scenes of V2X-Sim [1]. The ground truth and the completed scene observation are presented on the first two rows, and the detection results on the bottom row. The red boxes are predicted bounding boxes, and the green boxes are the ground truth.
Visualization of our semantic segmantation results on different scenes of V2X-Sim [1]. On the bottom row, different colors represent different semantic labels.
@inproceedings{li2022multi,
title={Multi-Robot Scene Completion: Towards Task-Agnostic Collaborative Perception},
author={Li, Yiming and Zhang, Juexiao and Ma, Dekun and Wang, Yue and Feng, Chen},
booktitle={6th Annual Conference on Robot Learning}
}