ActFormer: Scalable Collaborative Perception via Active Queries

1New York University
2Tsinghua University, work done during intern in NYU

Abstract

Collaborative perception leverages rich visual observations from multiple robots to extend a single robot's perception ability beyond its field of view. Many prior works receive messages broadcast from all collaborators, leading to a scalability challenge when dealing with a large number of robots and sensors. In this work, we aim to address scalable camera-based collaborative perception with a Transformer-based architecture. Our key idea is to enable a single robot to intelligently discern the relevance of the collaborators and their associated cameras according to a learned spatial prior. This proactive understanding of the visual features' relevance does not require the transmission of the features themselves, enhancing both communication and computation efficiency. Specifically, we present ActFormer, a Transformer that learns bird's eye view (BEV) representations by using predefined BEV queries to interact with multi-robot multi-camera inputs. Each BEV query can actively select relevant cameras for information aggregation based on pose information, instead of interacting with all cameras indiscriminately. Experiments on the V2X-Sim dataset demonstrate that ActFormer improves the detection performance from 29.89\% to 45.15\% in terms of AP@0.7 with about 50\% fewer queries, showcasing the effectiveness of ActFormer in multi-agent collaborative 3D object detection.

Contribution

  • We conceptualize a scalable and efficient collaborative perception framework that can actively and intelligently identify the most relevant sensory measurements based on spatial knowledge, without relying on the sensory measurements themselves.
  • We ground the concept of the scalable collaborative perception with a Transformer, \textit{i.e.}, \textit{ActFormer}, which uses a group of 3D-to-2D BEV queries to actively and efficiently aggregate the features from multi-robot multi-camera input, only relying on pose information.
  • We conduct comprehensive experiments in the task of collaborative object detection to verify the effectiveness and efficiency of our ActFormer..

  • Method



    Our motivation stems from the idea that how vehicles collaboratively perceive should be closely related to their relative poses. Different camera poses result in varying viewpoints, each capturing unique information. However, conventional collaborative methods often treat all viewpoints equally, overlooking the fact that these camera perspectives offer different insights into the environment—some unique, some overlapping, and some redundant. Consequently, the ego vehicle may not fully capitalize on the diverse perspectives available, leading to indiscriminate collaboration that generates excessive communication and computation. Actually, communication may not be necessary when some partners share very similar observations.