Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
157 tokens/sec
GPT-4o
43 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Sketch-based Video Object Localization (2304.00450v3)

Published 2 Apr 2023 in cs.CV

Abstract: We introduce Sketch-based Video Object Localization (SVOL), a new task aimed at localizing spatio-temporal object boxes in video queried by the input sketch. We first outline the challenges in the SVOL task and build the Sketch-Video Attention Network (SVANet) with the following design principles: (i) to consider temporal information of video and bridge the domain gap between sketch and video; (ii) to accurately identify and localize multiple objects simultaneously; (iii) to handle various styles of sketches; (iv) to be classification-free. In particular, SVANet is equipped with a Cross-modal Transformer that models the interaction between learnable object tokens, query sketch, and video through attention operations, and learns upon a per-frame set matching strategy that enables frame-wise prediction while utilizing global video context. We evaluate SVANet on a newly curated SVOL dataset. By design, SVANet successfully learns the mapping between the query sketches and video objects, achieving state-of-the-art results on the SVOL benchmark. We further confirm the effectiveness of SVANet via extensive ablation studies and visualizations. Lastly, we demonstrate its transfer capability on unseen datasets and novel categories, suggesting its high scalability in real-world applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (94)
  1. Localizing Moments in Video With Natural Language. In ICCV, pages 5803–5812, 2017.
  2. Surf: Speeded Up Robust Features. In ECCV, pages 404–417, 2006.
  3. Doodle It Yourself: Class Incremental Learning by Drawing a Few Sketches. In CVPR, pages 2293–2302, 2022.
  4. Sketch Less for More: On-The-Fly Fine-Grained Sketch-Based Image Retrieval. In CVPR, pages 9779–9788, 2020.
  5. Autonomous Indoor Robot Navigation Using a Sketch Interface for Drawing Maps and Routes. In ICRA, pages 2896–2901. IEEE, 2016.
  6. End-to-end referring video object segmentation with multimodal transformers. In CVPR, pages 4985–4995, 2022.
  7. Language Models Are Few-Shot Learners. In NeurIPS, pages 1877–1901, 2020.
  8. End-to-End Object Detection With Transformers. In ECCV, pages 213–229, 2020.
  9. WebQA: Multihop and Multimodal QA. In CVPR, pages 16495–16504, 2022.
  10. Sketchygan: Towards Diverse and Realistic Sketch to Image Synthesis. In CVPR, pages 9416–9425, 2018.
  11. Transformer Tracking. In CVPR, pages 8126–8135, 2021.
  12. End-to-End Video Object Detection With Spatial-Temporal Transformers. In CVPR, pages 10337–10346, 2020.
  13. Siamese Box Adaptive Network for Visual Tracking. In CVPR, pages 6668–6677, 2020.
  14. Histograms of Oriented Gradients for Human Detection. In CVPR, pages 886–893, 2005.
  15. Visual Grounding via Accumulated Attention. In CVPR, pages 7746–7755, 2018.
  16. TransVG: End-to-End Visual Grounding With Transformers. In ICCV, pages 1769–1779, 2021.
  17. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805, 2018.
  18. Doodle To Search: Practical Zero-Shot Sketch-Based Image Retrieval. In CVPR, pages 2179–2188, 2019.
  19. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv preprint arXiv:2010.11929, 2020.
  20. How Do Humans Sketch Objects? ACM TOG, pages 1–10, 2012.
  21. Person Tube Retrieval via Language Description. In AAAI, pages 10754–10761, 2020.
  22. Few-Shot Object Detection With Attention-RPN and Multi-Relation Detector. In CVPR, pages 4013–4022, 2020.
  23. Detect To Track and Track To Detect. In ICCV, pages 3038–3046, 2017.
  24. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding. arXiv preprint arXiv:1606.01847, 2016.
  25. Unsupervised Domain Adaptation by Backpropagation. In ICML, pages 1180–1189, 2015.
  26. Domain-Adversarial Training of Neural Networks. JMLR, pages 2096–2030, 2016.
  27. TALL: Temporal Activity Localization via Language Query. In ICCV, pages 5267–5275, 2017.
  28. Seq-NMS for Video Object Detection. arXiv preprint arXiv:1602.08465, 2016.
  29. Deep Residual Learning for Image Recognition. In CVPR, pages 770–778, 2016.
  30. One-Shot Object Detection With Co-Attention and Co-Excitation. In NeurIPS, 2019.
  31. Natural Language Object Retrieval. In CVPR, pages 4555–4564, 2016.
  32. Image-to-Image Translation With Conditional Adversarial Networks. In CVPR, pages 1125–1134, 2017.
  33. HandPainter-3D Sketching in VR With Hand-Based Physical Proxy. In CHI, pages 1–13, 2021.
  34. The Quick, Draw!-AI Experiment. Mount View, CA, accessed Feb, page 4, 2016.
  35. Hotr: End-to-End Human-Object Interaction Detection With Transformers. In CVPR, pages 74–83, 2021.
  36. What Are You Talking About? Text-to-Image Coreference. In CVPR, pages 3558–3565, 2014.
  37. Harold W Kuhn. The Hungarian Method for the Assignment Problem. Naval research logistics quarterly, pages 83–97, 1955.
  38. Mobi3dsketch: 3D Sketching in Mobile AR. In CHI, pages 1–11, 2019.
  39. Modality Mixer for Multi-modal Action Recognition. In WACV, pages 3298–3307, 2023.
  40. TVQA+: Spatio-Temporal Grounding for Video Question Answering. arXiv preprint arXiv:1904.11574, 2019.
  41. Siamrpn++: Evolution of Siamese Visual Tracking With Very Deep Networks. In CVPR, pages 4282–4291, 2019.
  42. High Performance Visual Tracking With Siamese Region Proposal Network. In CVPR, pages 8971–8980, 2018.
  43. Neural Speech Synthesis With Transformer Network. In AAAI, volume 33, pages 6706–6713, 2019.
  44. OSCAR: Object-Semantics Aligned Pre-training for Vision-Language Tasks. In ECCV, pages 121–137. Springer, 2020.
  45. Planar object tracking in the wild: A benchmark. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 651–658. IEEE, 2018.
  46. Feature pyramid networks for object detection. In CVPR, pages 2117–2125, 2017.
  47. Ioana Literat. “a pencil for your thoughts”: Participatory drawing as a visual research method with children and youth. International Journal of Qualitative Methods, 12(1):84–98, 2013.
  48. Scenesketcher-V2: Fine-Grained Scene-Level Sketch-Based Image Retrieval Using Adaptive GCNs. TIP, 31:3737–3751, 2022.
  49. Jialu Liu. Image Retrieval Based on Bag-of-Words Model. arXiv preprint arXiv:1304.5168, 2013.
  50. SSD: Single Shot Multibox Detector. In ECCV, pages 21–37, 2016.
  51. Decoupled Weight Decay Regularization. arXiv preprint arXiv:1711.05101, 2017.
  52. David G Lowe. Distinctive Image Features From Scale-Invariant Keypoints. IJCV, pages 91–110, 2004.
  53. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. NeurIPS, 32, 2019.
  54. 3D Shape Reconstruction From Sketches via Multi-View Convolutional Networks. In 3DV, pages 67–77. IEEE, 2017.
  55. OS2D: One-Stage One-Shot Object Detection by Matching Anchor Features. In ECCV, pages 635–652, 2020.
  56. Moment Matching for Multi-Source Domain Adaptation. In ICCV, pages 1406–1415, 2019.
  57. Faceshop: Deep Sketch-Based Face Image Editing. arXiv preprint arXiv:1804.08972, 2018.
  58. Learning Transferable Visual Models From Natural Language Supervision. In ICML, pages 8748–8763, 2021.
  59. Improving Language Understanding by Generative Pre-Training. 2018.
  60. Language Models Are Unsupervised Multitask Learners. OpenAI blog, page 9, 2019.
  61. Zero-Shot Text-To-Image Generation. In ICML, pages 8821–8831. PMLR, 2021.
  62. You Only Look Once: Unified, Real-Time Object Detection. In CVPR, pages 779–788, 2016.
  63. Faster R-CNN: Towards Real-Time Object Detection With Region Proposal Networks. In NeurIPS, 2015.
  64. Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression. In CVPR, pages 658–666, 2019.
  65. Localizing Infinity-Shaped Fishes: Sketch-Guided Object Localization in the Wild. arXiv preprint arXiv:2109.11874, 2021.
  66. Imagenet Large Scale Visual Recognition Challenge. IJCV, pages 211–252, 2015.
  67. Stylemeup: Towards Style-Agnostic Sketch-Based Image Retrieval. In CVPR, pages 8504–8513, 2021.
  68. Sketch and Run: A Stroke-Based Interface for Home Robots. In CHI, pages 197–200, 2009.
  69. The Sketchy Database: Learning To Retrieve Badly Drawn Bunnies. ACM TOG, pages 1–12, 2016.
  70. Deep Spatial-Semantic Attention for Fine-Grained Sketch-Based Image Retrieval. In ICCV, pages 5551–5560, 2017.
  71. Stvgbert: A Visual-Linguistic Transformer Based Framework for Spatio-Temporal Video Grounding. In ICCV, pages 1533–1542, 2021.
  72. DLI-Net: Dual Local Interaction Network for Fine-Grained Sketch-Based Image Retrieval. TCSVT, 32(10):7177–7189, 2022.
  73. Sketchrec 2023: 3rd workshop on sketch recognition. In Companion Proceedings of the 28th International Conference on Intelligent User Interfaces, pages 1–1, 2023.
  74. Sketch-Guided Object Localization in Natural Images. In ECCV, pages 532–547, 2020.
  75. Laurens Van der Maaten and Geoffrey Hinton. Visualizing Data Using T-SNE. JMLR, 2008.
  76. Attention is All You Need. In NeurIPS, pages 5998–6008, 2017.
  77. Matching Networks for One Shot Learning. In NeurIPS, 2016.
  78. Sketch-Based 3D Shape Retrieval Using Convolutional Neural Networks. In CVPR, pages 1875–1883, 2015.
  79. Ptseformer: Progressive temporal-spatial enhanced transformer towards video object detection. In ECCV, pages 732–747. Springer, 2022.
  80. Max-DeepLab: End-to-End Panoptic Segmentation With Mask Transformers. In CVPR, pages 5463–5474, 2021.
  81. Sketch Your Own Gan. In ICCV, pages 14050–14060, 2021.
  82. Towards Good Practices for Missing Modality Robust Action Recognition. arXiv preprint arXiv:2211.13916, 2022.
  83. Explore and Match: End-to-End Video Grounding With Transformer. arXiv preprint arXiv:2201.10168, 2022.
  84. Sequence Level Semantics Aggregation for Video Object Detection. In ICCV, pages 9217–9225, 2019.
  85. Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-Offs in Video Classification. In ECCV, pages 305–321, 2018.
  86. Deep Learning for Free-Hand Sketch: A Survey. TPAMI, 2022.
  87. Fine-Grained Instance-Level Sketch-Based Video Retrieval. TCSVT, pages 1995–2007, 2020.
  88. Deep Plastic Surgery: Robust and Controllable Image Editing With Human-Drawn Sketches. In ECCV, pages 601–617. Springer, 2020.
  89. Lavt: Language-aware vision transformer for referring image segmentation. In CVPR, pages 18155–18165, 2022.
  90. Sketch Me That Shoe. In CVPR, pages 799–807, 2016.
  91. Fine-Grained Instance-Level Sketch-Based Image Retrieval. IJCV, 129(2):484–500, 2021.
  92. Video Corpus Moment Retrieval With Contrastive Learning. In SIGIR, pages 685–695, 2021.
  93. Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences. In CVPR, pages 10668–10677, 2020.
  94. Transvod: end-to-end video object detection with spatial-temporal transformers. TPAMI, 2022.

Summary

We haven't generated a summary for this paper yet.