Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
158 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MOT-DETR: 3D Single Shot Detection and Tracking with Transformers to build 3D representations for Agro-Food Robots (2311.15674v3)

Published 27 Nov 2023 in cs.RO

Abstract: In the current demand for automation in the agro-food industry, accurately detecting and localizing relevant objects in 3D is essential for successful robotic operations. However, this is a challenge due the presence of occlusions. Multi-view perception approaches allow robots to overcome occlusions, but a tracking component is needed to associate the objects detected by the robot over multiple viewpoints. Most multi-object tracking (MOT) algorithms are designed for high frame rate sequences and struggle with the occlusions generated by robots' motions and 3D environments. In this paper, we introduce MOT-DETR, a novel approach to detect and track objects in 3D over time using a combination of convolutional networks and transformers. Our method processes 2D and 3D data, and employs a transformer architecture to perform data fusion. We show that MOT-DETR outperforms state-of-the-art multi-object tracking methods. Furthermore, we prove that MOT-DETR can leverage 3D data to deal with long-term occlusions and large frame-to-frame distances better than state-of-the-art methods. Finally, we show how our method is resilient to camera pose noise that can affect the accuracy of point clouds. The implementation of MOT-DETR can be found here: https://github.com/drapado/mot-detr

Definition Search Book Streamline Icon: https://streamlinehq.com
References (22)
  1. G. Kootstra, X. Wang, P. M. Blok, J. Hemming, and E. van Henten, “Selective Harvesting Robotics: Current Research, Trends, and Future Directions,” Current Robotics Reports, vol. 2, no. 1, pp. 95–104, Mar. 2021. [Online]. Available: https://doi.org/10.1007/s43154-020-00034-1
  2. J. Crowley, “Dynamic world modeling for an intelligent mobile robot using a rotating ultra-sonic ranging device,” in Proceedings. 1985 IEEE International Conference on Robotics and Automation, vol. 2.   St. Louis, MO, USA: Institute of Electrical and Electronics Engineers, 1985, pp. 128–135. [Online]. Available: http://ieeexplore.ieee.org/document/1087380/
  3. J. Elfring, S. van den Dries, M. van de Molengraft, and M. Steinbuch, “Semantic world modeling using probabilistic multiple hypothesis anchoring,” Robotics and Autonomous Systems, vol. 61, no. 2, pp. 95–105, Feb. 2013. [Online]. Available: https://linkinghub.elsevier.com/retrieve/pii/S0921889012002163
  4. B. Arad, J. Balendonck, R. Barth, O. Ben‐Shahar, Y. Edan, T. Hellström, J. Hemming, P. Kurtser, O. Ringdahl, T. Tielen, and B. v. Tuijl, “Development of a sweet pepper harvesting robot,” Journal of Field Robotics, vol. n/a, no. n/a, 2020, _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/rob.21937. [Online]. Available: https://www.onlinelibrary.wiley.com/doi/abs/10.1002/rob.21937
  5. A. K. Burusa, J. Scholten, D. R. Rincon, X. Wang, E. J. van Henten, and G. Kootstra, “Efficient Search and Detection of Relevant Plant Parts using Semantics-Aware Active Vision,” June 2023, arXiv:2306.09801 [cs]. [Online]. Available: http://arxiv.org/abs/2306.09801
  6. L. L. Wong, L. P. Kaelbling, and T. Lozano-Pérez, “Data association for semantic world modeling from partial views,” The International Journal of Robotics Research, vol. 34, no. 7, pp. 1064–1082, June 2015, publisher: SAGE Publications Ltd STM. [Online]. Available: https://doi.org/10.1177/0278364914559754
  7. A. Persson, P. Z. D. Martires, A. Loutfi, and L. De Raedt, “Semantic Relational Object Tracking,” IEEE Transactions on Cognitive and Developmental Systems, vol. 12, no. 1, pp. 84–97, Mar. 2020, arXiv: 1902.09937. [Online]. Available: http://arxiv.org/abs/1902.09937
  8. D. Rapado-Rincón, E. J. van Henten, and G. Kootstra, “Development and evaluation of automated localisation and reconstruction of all fruits on tomato plants in a greenhouse based on multi-view perception and 3D multi-object tracking,” Biosystems Engineering, vol. 231, pp. 78–91, July 2023. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S1537511023001162
  9. A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft, “Simple Online and Realtime Tracking,” 2016 IEEE International Conference on Image Processing (ICIP), pp. 3464–3468, Sept. 2016, arXiv: 1602.00763. [Online]. Available: http://arxiv.org/abs/1602.00763
  10. N. Wojke, A. Bewley, and D. Paulus, “Simple online and realtime tracking with a deep association metric,” in 2017 IEEE International Conference on Image Processing (ICIP), Sept. 2017, pp. 3645–3649, iSSN: 2381-8549.
  11. Y. Zhang, C. Wang, X. Wang, W. Zeng, and W. Liu, “FairMOT: On the Fairness of Detection and Re-identification in Multiple Object Tracking,” International Journal of Computer Vision, vol. 129, no. 11, pp. 3069–3087, Nov. 2021. [Online]. Available: https://doi.org/10.1007/s11263-021-01513-4
  12. T. Meinhardt, A. Kirillov, L. Leal-Taixe, and C. Feichtenhofer, “TrackFormer: Multi-Object Tracking with Transformers,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).   New Orleans, LA, USA: IEEE, June 2022, pp. 8834–8844. [Online]. Available: https://ieeexplore.ieee.org/document/9879668/
  13. F. Zeng, B. Dong, Y. Zhang, T. Wang, X. Zhang, and Y. Wei, “MOTR: End-to-End Multiple-Object Tracking with Transformer,” July 2022, arXiv:2105.03247 [cs]. [Online]. Available: http://arxiv.org/abs/2105.03247
  14. N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-End Object Detection with Transformers,” May 2020, arXiv:2005.12872 [cs]. [Online]. Available: http://arxiv.org/abs/2005.12872
  15. M. Halstead, C. McCool, S. Denman, T. Perez, and C. Fookes, “Fruit Quantity and Ripeness Estimation Using a Robotic Vision System,” IEEE Robotics and Automation Letters, vol. 3, no. 4, pp. 2995–3002, Oct. 2018. [Online]. Available: https://ieeexplore.ieee.org/document/8392450/
  16. R. Kirk, M. Mangan, and G. Cielniak, “Robust Counting of Soft Fruit Through Occlusions with Re-identification,” in Computer Vision Systems, ser. Lecture Notes in Computer Science, M. Vincze, T. Patten, H. I. Christensen, L. Nalpantidis, and M. Liu, Eds.   Cham: Springer International Publishing, 2021, pp. 211–222.
  17. M. Halstead, A. Ahmadi, C. Smitt, O. Schmittmann, and C. McCool, “Crop Agnostic Monitoring Driven by Deep Learning,” Frontiers in Plant Science, vol. 12, 2021. [Online]. Available: https://www.frontiersin.org/article/10.3389/fpls.2021.786702
  18. J. Villacrés, M. Viscaino, J. Delpiano, S. Vougioukas, and F. Auat Cheein, “Apple orchard production estimation using deep learning strategies: A comparison of tracking-by-detection algorithms,” Computers and Electronics in Agriculture, vol. 204, p. 107513, Jan. 2023. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0168169922008213
  19. D. Rapado-Rincón, E. J. van Henten, and G. Kootstra, “MinkSORT: A 3D deep feature extractor using sparse convolutions to improve 3D multi-object tracking in greenhouse tomato plants,” July 2023, arXiv:2307.05219 [cs]. [Online]. Available: http://arxiv.org/abs/2307.05219
  20. R. Hemmerling, O. Kniemeyer, D. Lanwert, W. Kurth, and G. Buck-Sorlin, “The rule-based language XL and the modelling environment GroIMP illustrated with simulated tree competition,” Functional Plant Biology, vol. 35, no. 10, pp. 739–750, Nov. 2008, publisher: CSIRO PUBLISHING. [Online]. Available: https://www.publish.csiro.au/fp/FP08052
  21. Q.-Y. Zhou, J. Park, and V. Koltun, “Open3D: A Modern Library for 3D Data Processing,” Jan. 2018, arXiv:1801.09847 [cs]. [Online]. Available: http://arxiv.org/abs/1801.09847
  22. M. Afonso, H. Fonteijn, F. S. Fiorentin, D. Lensink, M. Mooij, N. Faber, G. Polder, and R. Wehrens, “Tomato Fruit Detection and Counting in Greenhouses Using Deep Learning,” Frontiers in Plant Science, vol. 11, 2020. [Online]. Available: https://www.frontiersin.org/article/10.3389/fpls.2020.571299
Citations (1)

Summary

  • The paper presents a transformer-based single-shot method that fuses 2D images and 3D point clouds to overcome occlusions and sensor noise.
  • It employs self- and cross-attention mechanisms for effective object detection, classification, and re-identification in agricultural environments.
  • Experimental results show superior tracking performance compared to traditional methods, enhancing robotic navigation in complex agro settings.

Introduction

The agro-food industry relies heavily on robotics to face labor shortages and meet production demands. Precise 3D detection and localization of objects are essential for robotic systems in complex agricultural environments. However, occlusions and sensor noise can pose significant challenges.

Background and Contributions

Traditional multi-object tracking (MOT) techniques, including two-stage and recurrent methods, have made significant strides but often fall short in agricultural scenarios with low frame rates and significant occlusions. Despite advancements, popular algorithms like SORT and DeepSORT struggle in environments with obstructed viewpoints and drastic perspective changes inherent to robot operations. This paper introduces MOT-DETR, a transformative method for both detecting and tracking objects over time using convolutional networks and transformers. This technique is particularly designed for 3D environments, enabling robots to construct accurate representations even in occluded settings.

The paper's contributions can be summarized as follows:

  • MOT-DETR: a pioneering deep learning method employing transformers for efficient MOT.
  • A strategy for integrating 3D data to enhance MOT in environments with complicated occlusions.
  • Comparisons between MOT-DETR and existing state-of-the-art tracking methods.
  • Testing the robustness of MOT-DETR under varied noise levels in camera pose estimations.

Approach and Architecture

The proposed MOT-DETR processes both 2D images and 3D point clouds. It leverages transformers' self- and cross-attention mechanisms to fuse data from color images and point clouds for improved object differentiation. This method creates 2D bounding boxes, classifies objects, and employs re-identification (re-ID) features for tracking across views. An important distinction of MOT-DETR is its adaptability to single-shot detection and tracking, simplifying the training and operation compared to recurrent methods.

Experiments and Results

MOT-DETR's performance is evaluated in real and synthetic scenarios. Synthetic 3D models of tomato plants are generated to provide training and testing data, allowing the creation of a vast dataset to effectively train the deep neural network. Moreover, the implementation displays resilience when exposed to noise in camera pose estimations, suggesting suitability for real-world robotic applications with inherent sensor inaccuracies. In comparison to state-of-the-art methods, MOT-DETR demonstrates superior tracking performance, especially in sequences with long-term occlusions and significant viewpoint changes.

Conclusion

By integrating 3D data and transformer architecture, the novel MOT-DETR provides a significant advancement in the way robots could navigate, track, and interact with their environments in the agro-food industry. Its application paves the way for improved automation and efficiency in contexts where visual occlusion and sensor noise are prevalent. The model’s robustness against camera pose noise also indicates its potential utility across various robot-operated systems beyond agriculture.

X Twitter Logo Streamline Icon: https://streamlinehq.com