Emergent Mind

Abstract

In the current demand for automation in the agro-food industry, accurately detecting and localizing relevant objects in 3D is essential for successful robotic operations. However, this is a challenge due the presence of occlusions. Multi-view perception approaches allow robots to overcome occlusions, but a tracking component is needed to associate the objects detected by the robot over multiple viewpoints. Most multi-object tracking (MOT) algorithms are designed for high frame rate sequences and struggle with the occlusions generated by robots' motions and 3D environments. In this paper, we introduce MOT-DETR, a novel approach to detect and track objects in 3D over time using a combination of convolutional networks and transformers. Our method processes 2D and 3D data, and employs a transformer architecture to perform data fusion. We show that MOT-DETR outperforms state-of-the-art multi-object tracking methods. Furthermore, we prove that MOT-DETR can leverage 3D data to deal with long-term occlusions and large frame-to-frame distances better than state-of-the-art methods. Finally, we show how our method is resilient to camera pose noise that can affect the accuracy of point clouds. The implementation of MOT-DETR can be found here: https://github.com/drapado/mot-detr

Overview

  • Introduces MOT-DETR for 3D object detection and tracking in agro-food robotics.

  • Addresses challenges like occlusions and sensor noise in agricultural settings.

  • Employs transformers alongside convolutional networks to fuse 2D and 3D data.

  • Demonstrates improved tracking performance compared to existing methods.

  • Shows promise for robust robotic applications despite sensor inaccuracies.

Introduction

The agro-food industry relies heavily on robotics to face labor shortages and meet production demands. Precise 3D detection and localization of objects are essential for robotic systems in complex agricultural environments. However, occlusions and sensor noise can pose significant challenges.

Background and Contributions

Traditional multi-object tracking (MOT) techniques, including two-stage and recurrent methods, have made significant strides but often fall short in agricultural scenarios with low frame rates and significant occlusions. Despite advancements, popular algorithms like SORT and DeepSORT struggle in environments with obstructed viewpoints and drastic perspective changes inherent to robot operations. This paper introduces MOT-DETR, a transformative method for both detecting and tracking objects over time using convolutional networks and transformers. This technique is particularly designed for 3D environments, enabling robots to construct accurate representations even in occluded settings.

The paper's contributions can be summarized as follows:

  • MOT-DETR: a pioneering deep learning method employing transformers for efficient MOT.
  • A strategy for integrating 3D data to enhance MOT in environments with complicated occlusions.
  • Comparisons between MOT-DETR and existing state-of-the-art tracking methods.
  • Testing the robustness of MOT-DETR under varied noise levels in camera pose estimations.

Approach and Architecture

The proposed MOT-DETR processes both 2D images and 3D point clouds. It leverages transformers' self- and cross-attention mechanisms to fuse data from color images and point clouds for improved object differentiation. This method creates 2D bounding boxes, classifies objects, and employs re-identification (re-ID) features for tracking across views. An important distinction of MOT-DETR is its adaptability to single-shot detection and tracking, simplifying the training and operation compared to recurrent methods.

Experiments and Results

MOT-DETR's performance is evaluated in real and synthetic scenarios. Synthetic 3D models of tomato plants are generated to provide training and testing data, allowing the creation of a vast dataset to effectively train the deep neural network. Moreover, the implementation displays resilience when exposed to noise in camera pose estimations, suggesting suitability for real-world robotic applications with inherent sensor inaccuracies. In comparison to state-of-the-art methods, MOT-DETR demonstrates superior tracking performance, especially in sequences with long-term occlusions and significant viewpoint changes.

Conclusion

By integrating 3D data and transformer architecture, the novel MOT-DETR provides a significant advancement in the way robots could navigate, track, and interact with their environments in the agro-food industry. Its application paves the way for improved automation and efficiency in contexts where visual occlusion and sensor noise are prevalent. The model’s robustness against camera pose noise also indicates its potential utility across various robot-operated systems beyond agriculture.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.