Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Dataset and Framework for Learning State-invariant Object Representations (2404.06470v2)

Published 9 Apr 2024 in cs.CV, cs.IR, and cs.LG

Abstract: We add one more invariance - the state invariance - to the more commonly used other invariances for learning object representations for recognition and retrieval. By state invariance, we mean robust with respect to changes in the structural form of the objects, such as when an umbrella is folded, or when an item of clothing is tossed on the floor. In this work, we present a novel dataset, ObjectsWithStateChange, which captures state and pose variations in the object images recorded from arbitrary viewpoints. We believe that this dataset will facilitate research in fine-grained object recognition and retrieval of 3D objects that are capable of state changes. The goal of such research would be to train models capable of learning discriminative object embeddings that remain invariant to state changes while also staying invariant to transformations induced by changes in viewpoint, pose, illumination, etc. A major challenge in this regard is that instances of different objects (both within and across different categories) under various state changes may share similar visual characteristics and therefore may be close to one another in the learned embedding space, which would make it more difficult to discriminate between them. To address this, we propose a curriculum learning strategy that progressively selects object pairs with smaller inter-object distances in the learned embedding space during the training phase. This approach gradually samples harder-to-distinguish examples of visually similar objects, both within and across different categories. Our ablation related to the role played by curriculum learning indicates an improvement in object recognition accuracy of 7.9% and retrieval mAP of 9.2% over the state-of-the-art on our new dataset, as well as three other challenging multi-view datasets such as ModelNet40, ObjectPI, and FG3D.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao, “3d shapenets: A deep representation for volumetric shapes,” in Computer Vision and Pattern Recognition (CVPR).   Los Alamitos, CA, USA: IEEE Computer Society, jun 2015, pp. 1912–1920.
  2. X. Liu, Z. Han, Y.-S. Liu, and M. Zwicker, “Fine-grained 3d shape classification with hierarchical part-view attentions,” IEEE Transactions on Image Processing (TIP), 2021.
  3. B. Leibe and B. Schiele, “Analyzing appearance and contour based methods for object categorization,” in Computer Vision and Pattern Recognition (CVPR), vol. 2.   IEEE, 2003, pp. II–409.
  4. K. Lai, L. Bo, X. Ren, and D. Fox, “A large-scale hierarchical multi-view rgb-d object dataset,” in International Conference on Robotics and Automation (ICRA), May 2011, pp. 1817–1824.
  5. A. Kanezaki, Y. Matsushita, and Y. Nishida, “Rotationnet: Joint object categorization and pose estimation using multiviews from unsupervised viewpoints,” Computer Vision and Pattern Recognition (CVPR), pp. 5010–5019, 2018.
  6. X. Wang, T. Ma, J. Ainooson, S. Cha, X. Wang, A. Molla, and M. Kunda, “Seeing neural networks through a box of toys: The toybox dataset of visual object transformations,” 2018.
  7. C.-H. Ho, P. Morgado, A. Persekian, and N. Vasconcelos, “Pies: Pose invariant embeddings,” in Computer Vision and Pattern Recognition (CVPR), June 2019.
  8. A. Barbu, D. Mayo, J. Alverio, W. Luo, C. Wang, D. Gutfreund, J. Tenenbaum, and B. Katz, “Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 32, 2019.
  9. X. Wei, Q. Cui, L. Yang, P. Wang, and L. Liu, “RPC: A large-scale retail product checkout dataset,” CoRR, vol. abs/1901.07249, 2019.
  10. R. Wang, J. Wang, T. S. Kim, J. Kim, and H.-J. Lee, “Mvp-n: A dataset and benchmark for real-world multi-view object classification,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 35, 2022, pp. 20 536–20 550.
  11. R. Sarkar and A. Kak, “Dual pose-invariant embeddings: Learning category and object-specific discriminative representations for recognition and retrieval,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2024. [Online]. Available: https://arxiv.org/abs/2403.00272
  12. A. Kanezaki, Y. Matsushita, and Y. Nishida, “Rotationnet: Joint object categorization and pose estimation using multiviews from unsupervised viewpoints,” in Computer Vision and Pattern Recognition (CVPR), 2018.
  13. S. Chopra, R. Hadsell, and Y. LeCun, “Learning a similarity metric discriminatively, with application to face verification,” in Computer Vision and Pattern Recognition (CVPR), vol. 1, 2005, pp. 539–546.
  14. E. Hoffer and N. Ailon, “Deep metric learning using triplet network,” Lecture Notes in Computer Science, p. 84–92, 2015.
  15. W. Chen, X. Chen, J. Zhang, and K. Huang, “Beyond triplet loss: A deep quadruplet network for person re-identification,” in Computer Vision and Pattern Recognition (CVPR), 2017, pp. 1320–1329.
  16. H. Xuan, A. Stylianou, X. Liu, and R. Pless, “Hard negative examples are hard, but useful,” in European Conference on Computer Vision (ECCV), A. Vedaldi, H. Bischof, T. Brox, and J.-M. Frahm, Eds.   Springer International Publishing, 2020, pp. 126–142.
  17. C. Huang, C. C. Loy, and X. Tang, “Local similarity-aware deep feature embedding,” in Neural Information Processing Systems (NeurIPS), Red Hook, NY, USA, 2016, p. 1270–1278.
  18. F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” Computer Vision and Pattern Recognition (CVPR), Jun 2015.
  19. H. O. Song, Y. Xiang, S. Jegelka, and S. Savarese, “Deep metric learning via lifted structured feature embedding,” in Computer Vision and Pattern Recognition (CVPR), 2016, pp. 4004–4012.
  20. B. Harwood, V. K. B.G., G. Carneiro, I. Reid, and T. Drummond, “Smart mining for deep metric learning,” in International Conference on Computer Vision (ICCV).   Los Alamitos, CA, USA: IEEE Computer Society, oct 2017, pp. 2840–2848.
  21. W. Ge, W. Huang, D. Dong, and M. R. Scott, “Deep metric learning with hierarchical triplet loss,” in European Conference on Computer Vision (ECCV).   Springer International Publishing, 2018, pp. 272–288.
  22. X. Wang, H. Zhang, W. Huang, and M. R. Scott, “Cross-batch memory for embedding learning,” in Computer Vision and Pattern Recognition (CVPR), 2020, pp. 6388–6397.
  23. Y. Suh, B. Han, W. Kim, and K. M. Lee, “Stochastic class-based hard example mining for deep metric learning,” in Computer Vision and Pattern Recognition (CVPR), 2019, pp. 7244–7252.
  24. Y. Movshovitz-Attias, A. Toshev, T. K. Leung, S. Ioffe, and S. Singh, “No fuss distance metric learning using proxies,” in International Conference on Computer Vision (ICCV), 2017, pp. 360–368.
  25. Y. Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculum learning,” in International Conference on Machine Learning (ICML), 2009, p. 41–48.
  26. X. Wang, Y. Chen, and W. Zhu, “A survey on curriculum learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 44, no. 9, pp. 4555–4576, 2022.
  27. A. Sanakoyeu, V. Tschernezki, U. Büchler, and B. Ommer, “Divide and conquer the embedding space for metric learning,” in Computer Vision and Pattern Recognition (CVPR), 2019.
  28. A. Sanakoyeu, P. Ma, V. Tschernezki, and B. Ommer, “Improving deep metric learning by divide and conquer,” IEEE Transactions on pattern analysis and machine intelligence (TPAMI), 2021.
  29. R. Sarkar, N. Bodla, M. Vasileva, Y.-L. Lin, A. Beniwal, A. Lu, and G. Medioni, “Outfittransformer: Outfit representations for fashion recommendation,” in Computer Vision and Pattern Recognition Workshops (CVPRW), June 2022, pp. 2263–2267.
  30. R. Sarkar, N. Bodla, M. I. Vasileva, Y.-L. Lin, A. Beniwal, A. Lu, and G. Medioni, “Outfittransformer: Learning outfit representations for fashion recommendation,” in Winter Conference on Applications of Computer Vision (WACV), January 2023, pp. 3601–3609.
  31. H. Su, S. Maji, E. Kalogerakis, and E. G. Learned-Miller, “Multi-view convolutional neural networks for 3d shape recognition,” in International Conference on Computer Vision (ICCV), 2015.
  32. X. He, Y. Zhou, Z. Zhou, S. Bai, and X. Bai, “Triplet-center loss for multi-view 3d object retrieval,” Computer Vision and Pattern Recognition (CVPR), Jun 2018.
  33. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations (ICLR), 2021.
  34. W. Nie, Y. Zhao, D. Song, and Y. Gao, “Dan: Deep-attention network for 3d shape recognition,” IEEE Trans. on Image Processing, vol. 30, pp. 4371–4383, 2021.
  35. X. Wei, Y. Gong, F. Wang, X. Sun, and J. Sun, “Learning canonical view representation for 3d shape recognition with arbitrary views,” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV).   Los Alamitos, CA, USA: IEEE Computer Society, oct 2021, pp. 397–406.
  36. W. Liu, Y. Wen, Z. Yu, and M. Yang, “Large-margin softmax loss for convolutional neural networks,” in International Conference on Machine Learning (ICML), ser. ICML’16.   JMLR.org, 2016, p. 507–516.
  37. K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in International Conference on Learning Representations (ICLR), 2015.
  38. R. Sarkar and A. C. Kak, “Checksoft : A scalable event-driven software architecture for keeping track of people and things in people-centric spaces,” CoRR, vol. abs/2102.10513, 2021.
  39. R. Sarkar and A. Kak, “Scalable event-driven software architecture for the automation of people-centric systems,” Dec. 10 2020, uS Patent App. Publication US20200388121A1.
  40. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International Conference on Machine Learning (ICML), 2021, pp. 8748–8763.
  41. D. Mizrahi, R. Bachmann, O. Kar, T. Yeo, M. Gao, A. Dehghan, and A. Zamir, “4m: Massively multimodal masked modeling,” Advances in Neural Information Processing Systems (NeurIPS), vol. 36, 2024.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets