MOT-DETR: 3D Single Shot Detection and Tracking with Transformers to build 3D representations for Agro-Food Robots (2311.15674v3)

Published 27 Nov 2023 in cs.RO

Abstract: In the current demand for automation in the agro-food industry, accurately detecting and localizing relevant objects in 3D is essential for successful robotic operations. However, this is a challenge due the presence of occlusions. Multi-view perception approaches allow robots to overcome occlusions, but a tracking component is needed to associate the objects detected by the robot over multiple viewpoints. Most multi-object tracking (MOT) algorithms are designed for high frame rate sequences and struggle with the occlusions generated by robots' motions and 3D environments. In this paper, we introduce MOT-DETR, a novel approach to detect and track objects in 3D over time using a combination of convolutional networks and transformers. Our method processes 2D and 3D data, and employs a transformer architecture to perform data fusion. We show that MOT-DETR outperforms state-of-the-art multi-object tracking methods. Furthermore, we prove that MOT-DETR can leverage 3D data to deal with long-term occlusions and large frame-to-frame distances better than state-of-the-art methods. Finally, we show how our method is resilient to camera pose noise that can affect the accuracy of point clouds. The implementation of MOT-DETR can be found here: https://github.com/drapado/mot-detr

References (22)

Citations (1)

View on Semantic Scholar

Summary

The paper presents a transformer-based single-shot method that fuses 2D images and 3D point clouds to overcome occlusions and sensor noise.
It employs self- and cross-attention mechanisms for effective object detection, classification, and re-identification in agricultural environments.
Experimental results show superior tracking performance compared to traditional methods, enhancing robotic navigation in complex agro settings.

Introduction

The agro-food industry relies heavily on robotics to face labor shortages and meet production demands. Precise 3D detection and localization of objects are essential for robotic systems in complex agricultural environments. However, occlusions and sensor noise can pose significant challenges.

Background and Contributions

Traditional multi-object tracking (MOT) techniques, including two-stage and recurrent methods, have made significant strides but often fall short in agricultural scenarios with low frame rates and significant occlusions. Despite advancements, popular algorithms like SORT and DeepSORT struggle in environments with obstructed viewpoints and drastic perspective changes inherent to robot operations. This paper introduces MOT-DETR, a transformative method for both detecting and tracking objects over time using convolutional networks and transformers. This technique is particularly designed for 3D environments, enabling robots to construct accurate representations even in occluded settings.

The paper's contributions can be summarized as follows:

MOT-DETR: a pioneering deep learning method employing transformers for efficient MOT.
A strategy for integrating 3D data to enhance MOT in environments with complicated occlusions.
Comparisons between MOT-DETR and existing state-of-the-art tracking methods.
Testing the robustness of MOT-DETR under varied noise levels in camera pose estimations.

Approach and Architecture

The proposed MOT-DETR processes both 2D images and 3D point clouds. It leverages transformers' self- and cross-attention mechanisms to fuse data from color images and point clouds for improved object differentiation. This method creates 2D bounding boxes, classifies objects, and employs re-identification (re-ID) features for tracking across views. An important distinction of MOT-DETR is its adaptability to single-shot detection and tracking, simplifying the training and operation compared to recurrent methods.

Experiments and Results

MOT-DETR's performance is evaluated in real and synthetic scenarios. Synthetic 3D models of tomato plants are generated to provide training and testing data, allowing the creation of a vast dataset to effectively train the deep neural network. Moreover, the implementation displays resilience when exposed to noise in camera pose estimations, suggesting suitability for real-world robotic applications with inherent sensor inaccuracies. In comparison to state-of-the-art methods, MOT-DETR demonstrates superior tracking performance, especially in sequences with long-term occlusions and significant viewpoint changes.

Conclusion

By integrating 3D data and transformer architecture, the novel MOT-DETR provides a significant advancement in the way robots could navigate, track, and interact with their environments in the agro-food industry. Its application paves the way for improved automation and efficiency in contexts where visual occlusion and sensor noise are prevalent. The model’s robustness against camera pose noise also indicates its potential utility across various robot-operated systems beyond agriculture.

PDF Markdown

Related Papers

GitHub

GitHub - drapado/mot-detr: MOT-DETR: 3D Single Shot Detection and Tracking with Transformers to build 3D representations for Agro-Food Robots

Tweets

https://twitter.com/OWW/status/1869819415480471752