Multi Robot Scene Completion: Towards Task-agnostic Collaborative Perception

CoRL 2022

1New York University
2Massachusetts Institute of Technology

The scene is completed through multi-robot collaboration.

The completed scene can be directly used in any perception without fine-tuning.

Individual reconstruction

Scene completion

Object detection

Semantic segmantation

The results for object detection and semantic segmantation are obtained by directly feeding the completion output to singe-agent perception models without any fine-tuning. For object detection, the red boxes are predicted bounding boxes, and the green boxes are the ground truth.

Task-specific vs Task-agnostic collaboration. As shown in (a), task-specific paradigm learns different models with different losses for each task. Whereas for task-agnostic paradigm shown in (b), the model learns to directly reconstruct the multi-robot scene based on each robot's message, which is independent from yet still usable by all downstream tasks.


Collaborative perception learns how to share information among multiple robots to perceive the environment better than individually done. Past research on this has been task-specific, such as detection or segmentation. Yet this leads to different information sharing for different tasks, hindering the large-scale deployment of collaborative perception. We propose the first task-agnostic collaborative perception paradigm that learns a single collaboration module in a self-supervised manner for different downstream tasks. This is done by a novel task termed multi-robot scene completion, where each robot learns to effectively share information for reconstructing a complete scene viewed by all robots. Moreover, we propose a spatiotemporal autoencoder (STAR) that amortizes over time the communication cost by spatial sub-sampling and temporal mixing. Extensive experiments validate our method's effectiveness on scene completion and collaborative perception in autonomous driving scenarios.


  • We propose a brand-new task-agnostic collaborative perception framework based on multi-robot scene completion, decoupling the collaboration learning from downstream tasks.
  • We propose asynchronous training and synchronous inference with a shared autoencoder to solve the proposed task, eliminating the need for synchronous data for collaboration learning.
  • We develop a novel spatiotemporal autoencoder (STAR) that reconstructs scenes based on temporally mixed information. It amortizes the spatial communication volume over time to improve the performance-bandwidth trade-off.
  • We conduct extensive experiments to verify our method's effectiveness for scene completion and downstream perception in autonomous driving scenarios.

  • Method

    Asynchronous training and synchronous inference. In the top right, the asynchronous training does not require communication between robots, while synchronous training, shown in the bottom right, requires communication and optimized with regards to each specific task loss. The synchronous inference is illustrated on the left. The sender transmit encoded representations to the receiver. The receiver uses a mixture of spatio-temporal tokens to complete the scene observation.

    Qualitative  Results


    Visualization of our scene completion results on different scenes of V2X-Sim [1]. The three rows from top to bottom each represent: the ground truth scene observation made by the multiple robots presented at the same scene at the same time, the completed scene output by our model based on each robot's individual obervation and their corresponding poses, and the residual, or difference, between the ground truth and the prediction, which is the part the model fails to predict.

    Visualization of our object detection results on different scenes of V2X-Sim [1]. The ground truth and the completed scene observation are presented on the first two rows, and the detection results on the bottom row. The red boxes are predicted bounding boxes, and the green boxes are the ground truth.

    Visualization of our semantic segmantation results on different scenes of V2X-Sim [1]. On the bottom row, different colors represent different semantic labels.


          title={Multi-Robot Scene Completion: Towards Task-Agnostic Collaborative Perception},
          author={Li, Yiming and Zhang, Juexiao and Ma, Dekun and Wang, Yue and Feng, Chen},
          booktitle={6th Annual Conference on Robot Learning}