Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
162 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Dual DETRs for Multi-Label Temporal Action Detection (2404.00653v1)

Published 31 Mar 2024 in cs.CV

Abstract: Temporal Action Detection (TAD) aims to identify the action boundaries and the corresponding category within untrimmed videos. Inspired by the success of DETR in object detection, several methods have adapted the query-based framework to the TAD task. However, these approaches primarily followed DETR to predict actions at the instance level (i.e., identify each action by its center point), leading to sub-optimal boundary localization. To address this issue, we propose a new Dual-level query-based TAD framework, namely DualDETR, to detect actions from both instance-level and boundary-level. Decoding at different levels requires semantics of different granularity, therefore we introduce a two-branch decoding structure. This structure builds distinctive decoding processes for different levels, facilitating explicit capture of temporal cues and semantics at each level. On top of the two-branch design, we present a joint query initialization strategy to align queries from both levels. Specifically, we leverage encoder proposals to match queries from each level in a one-to-one manner. Then, the matched queries are initialized using position and content prior from the matched action proposal. The aligned dual-level queries can refine the matched proposal with complementary cues during subsequent decoding. We evaluate DualDETR on three challenging multi-label TAD benchmarks. The experimental results demonstrate the superior performance of DualDETR to the existing state-of-the-art methods, achieving a substantial improvement under det-mAP and delivering impressive results under seg-mAP.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (79)
  1. Temporal driver action localization using action classification methods. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 3318–3325, 2022.
  2. Boundary content graph neural network for temporal action proposal generation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVIII 16, pages 121–137. Springer, 2020.
  3. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the ieee conference on computer vision and pattern recognition, pages 961–970, 2015.
  4. End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020.
  5. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017.
  6. Rethinking the faster r-cnn architecture for temporal action localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1130–1139, 2018.
  7. Dcan: Improving temporal action detection via dual context aggregation. In Proceedings of the AAAI conference on artificial intelligence, pages 248–257, 2022a.
  8. Conditional detr v2: Efficient detection transformer with box queries. arXiv preprint arXiv:2207.08914, 2022b.
  9. Ctrn: Class-temporal relational network for action detection. arXiv preprint arXiv:2110.13473, 2021a.
  10. Pdan: Pyramid dilated attention network for action detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2970–2979, 2021b.
  11. Ms-tct: multi-scale temporal convtransformer for action detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20041–20051, 2022a.
  12. Toyota smarthome untrimmed: Real-world untrimmed videos for activity detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2):2533–2550, 2022b.
  13. Motion-aware contrastive video representation learning via foreground-background merging. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9716–9726, 2022a.
  14. Dual contrastive learning for spatio-temporal representation. In Proceedings of the 30th ACM international conference on multimedia, pages 5649–5658, 2022b.
  15. Prune spatio-temporal tokens by semantic-aware temporal accumulation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16945–16956, 2023.
  16. Accurate temporal action proposal generation with relation-aware pyramid network. In Proceedings of the AAAI conference on artificial intelligence, pages 10810–10817, 2020.
  17. Adamixer: A fast-converging query-based object detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5354–5363, 2022.
  18. Haan: Human action aware network for multi-label temporal action detection. In Proceedings of the 31st ACM International Conference on Multimedia, pages 5059–5069, 2023.
  19. Soccernet: A scalable dataset for action spotting in soccer videos. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 1711–1721, 2018.
  20. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
  21. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019.
  22. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  23. Snapshot ensembles: Train 1, get m for free. arXiv preprint arXiv:1704.00109, 2017.
  24. Movienet: A holistic dataset for movie understanding. In European Conference on Computer Vision, pages 709–727. Springer, 2020.
  25. The thumos challenge on action recognition for videos “in the wild”. Computer Vision and Image Understanding, 155:1–23, 2017.
  26. Coarse-fine networks for temporal activity detection in videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8385–8394, 2021.
  27. Self-feedback detr for temporal action detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10286–10296, 2023.
  28. Actions as moving points. In Computer Vision–ECCV 2020: 16th European Conference, pages 68–84. Springer, 2020.
  29. Multisports: A multi-person video dataset of spatio-temporally localized sports actions. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 13536–13545, 2021.
  30. Fast learning of temporal action proposal via dense boundary generator. In Proceedings of the AAAI conference on artificial intelligence, pages 11499–11506, 2020.
  31. Learning salient boundary feature for anchor-free temporal action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3320–3329, 2021.
  32. Bsn: Boundary sensitive network for temporal action proposal generation. In Proceedings of the European conference on computer vision (ECCV), pages 3–19, 2018.
  33. Bmn: Boundary-matching network for temporal action proposal generation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3889–3898, 2019.
  34. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017.
  35. Tsi: Temporal scale invariant network for action proposal generation. In Proceedings of the Asian Conference on Computer Vision, 2020.
  36. Dab-detr: Dynamic anchor boxes are better queries for detr. arXiv preprint arXiv:2201.12329, 2022a.
  37. Multi-shot temporal event localization: a benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12596–12606, 2021.
  38. End-to-end temporal action detection with transformer. IEEE Transactions on Image Processing, 31:5427–5441, 2022b.
  39. Multi-granularity generator for temporal action proposal. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3604–3613, 2019.
  40. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  41. Proposal-free temporal action detection via global segmentation mask learning. In European Conference on Computer Vision, pages 645–662. Springer, 2022.
  42. Temporal gaussian mixture layer for videos. In International Conference on Machine learning, pages 5152–5161. PMLR, 2019.
  43. Learning latent super-events to detect multiple activities in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5304–5313, 2018.
  44. Temporal context aggregation network for temporal action proposal refinement. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 485–494, 2021.
  45. Pat: Position-aware transformer for dense multi-label action detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2988–2997, 2023.
  46. Action sensitivity learning for temporal action localization. arXiv preprint arXiv:2305.15701, 2023.
  47. React: Temporal action detection with relational queries. In Computer Vision–ECCV 2022: 17th European Conference, pages 105–121. Springer, 2022.
  48. Temporal action localization with enhanced instant discriminability. arXiv preprint arXiv:2309.05590, 2023a.
  49. Tridet: Temporal action detection with relative boundary modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18857–18866, 2023b.
  50. Hollywood in homes: Crowdsourcing data collection for activity understanding. In Computer Vision–ECCV 2016: 14th European Conference,, pages 510–526. Springer, 2016.
  51. Two-stream convolutional networks for action recognition in videos. Advances in neural information processing systems, 27, 2014.
  52. Relaxed transformer decoders for direct action proposal generation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 13526–13535, 2021.
  53. Pointtad: Multi-label temporal action detection with learnable query points. In NeurIPS, 2022.
  54. Temporal perceiver: A general architecture for arbitrary boundary detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(10):12506–12520, 2023.
  55. Modeling multi-label action dependencies for temporal action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1460–1470, 2021.
  56. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  57. Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision, pages 20–36. Springer, 2016.
  58. Rcl: Recurrent continuous localization for temporal action detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13566–13575, 2022.
  59. Anchor detr: Query design for transformer-based object detection. arxiv 2021. arXiv preprint arXiv:2109.07107.
  60. Learning to refactor action and co-occurrence features for temporal action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13884–13893, 2022.
  61. Progressive visual prompt learning with contrastive feature re-formation. arXiv preprint arXiv:2304.08386, 2023a.
  62. Dpl: Decoupled prompt learning for vision-language models. arXiv preprint arXiv:2308.10061, 2023b.
  63. R-c3d: Region convolutional 3d network for temporal activity detection. In Proceedings of the IEEE international conference on computer vision, pages 5783–5792, 2017.
  64. G-tad: Sub-graph localization for temporal action detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10156–10165, 2020.
  65. Localization guided fight action detection in surveillance videos. In 2019 IEEE International Conference on Multimedia and Expo (ICME), pages 568–573. IEEE, 2019.
  66. Step: Spatio-temporal progressive learning for video action detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 264–272, 2019.
  67. Every moment counts: Dense detailed labeling of actions in complex videos. International Journal of Computer Vision, 126:375–389, 2018.
  68. Temporal action localization by structured maximal sums. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3684–3692, 2017.
  69. Graph convolutional networks for temporal action localization. In Proceedings of the IEEE/CVF international conference on computer vision, pages 7094–7103, 2019.
  70. Actionformer: Localizing moments of actions with transformers. In Computer Vision–ECCV 2022: 17th European Conference, pages 492–510. Springer, 2022a.
  71. Extracting motion and appearance via inter-frame attention for efficient video frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5682–5692, 2023.
  72. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. In International Conference on Learning Representations, 2022b.
  73. Video self-stitching graph network for temporal action localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13658–13667, 2021.
  74. Bottom-up temporal action localization with mutual regularization. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VIII 16, pages 539–555. Springer, 2020.
  75. Temporal action detection with structured segment networks. In Proceedings of the IEEE international conference on computer vision, pages 2914–2923, 2017.
  76. Asymmetric masked distillation for pre-training small foundation models. arXiv preprint arXiv:2311.03149, 2023.
  77. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159, 2020.
  78. Enriching local and global contexts for temporal action localization. In Proceedings of the IEEE/CVF international conference on computer vision, pages 13516–13525, 2021.
  79. Learning disentangled classification and localization representations for temporal action localization. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 3644–3652, 2022.
Citations (6)

Summary

We haven't generated a summary for this paper yet.