Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
98 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Efficient Pre-training for Localized Instruction Generation of Videos (2311.15964v4)

Published 27 Nov 2023 in cs.CV, cs.AI, cs.CL, and cs.LG

Abstract: Procedural videos, exemplified by recipe demonstrations, are instrumental in conveying step-by-step instructions. However, understanding such videos is challenging as it involves the precise localization of steps and the generation of textual instructions. Manually annotating steps and writing instructions is costly, which limits the size of current datasets and hinders effective learning. Leveraging large but noisy video-transcript datasets for pre-training can boost performance but demands significant computational resources. Furthermore, transcripts contain irrelevant content and differ in style from human-written instructions. To mitigate these issues, we propose a novel technique, Sieve-&-Swap, to automatically generate high-quality training data for the recipe domain: (i) Sieve: filters irrelevant transcripts and (ii) Swap: acquires high-quality text by replacing transcripts with human-written instruction from a text-only recipe dataset. The resulting dataset is three orders of magnitude smaller than current web-scale datasets but enables efficient training of large-scale models. Alongside Sieve-&-Swap, we propose Procedure Transformer (ProcX), a model for end-to-end step localization and instruction generation for procedural videos. When pre-trained on our curated dataset, this model achieves state-of-the-art performance on YouCook2 and Tasty while using a fraction of the training data. We have released code and dataset.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan, 2005. Association for Computational Linguistics.
  2. A closer look at temporal ordering in the segmentation of instructional videos. In British Machine Vision Conference (BMVC), 2022.
  3. Recipenlg: A cooking recipes dataset for semi-structured text generation. In Proceedings of the 13th International Conference on Natural Language Generation, pages 22–28, 2020.
  4. End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020.
  5. The relation between written and spoken language. Annual review of Anthropology, 16(1):383–407, 1987.
  6. Vindlu: A recipe for effective video-and-language pretraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10739–10750, 2023.
  7. Stepformer: Self-supervised step discovery and localization in instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18952–18961, 2023.
  8. Lois Einhorn. Oral and written style: An examination of differences. Southern Journal of Communication, 43(3):302–311, 1978.
  9. SODA: story oriented dense video captioning evaluation framework. In European Conference on Computer Vision, pages 517–531. Springer, 2020.
  10. Acquisition of localization confidence for accurate object detection. In Proceedings of the European conference on computer vision (ECCV), pages 784–799, 2018.
  11. Dense-captioning events in videos. In International Conference on Computer Vision (ICCV), 2017.
  12. Lite detr: An interleaved multi-scale encoder for efficient detr. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18558–18567, 2023.
  13. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017.
  14. End-to-end temporal action detection with transformer. IEEE Transactions on Image Processing (TIP), 2022.
  15. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  16. Univl: A unified video and language pre-training model for multimodal understanding and generation. arXiv preprint arXiv:2002.06353, 2020.
  17. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2630–2640, 2019.
  18. End-to-end learning of visual representations from uncurated instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9879–9889, 2020.
  19. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.
  20. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  21. Training tips for the transformer model. arXiv preprint arXiv:1804.00247, 2018.
  22. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, pages 28492–28518. PMLR, 2023.
  23. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv e-prints, 2019.
  24. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 658–666, 2019.
  25. Zero-shot anticipation for instructional activities. In The IEEE International Conference on Computer Vision (ICCV), 2019.
  26. React: Temporal action detection with relational queries. In European conference on computer vision, pages 105–121. Springer, 2022.
  27. Mpnet: Masked and permuted pre-training for language understanding. Advances in Neural Information Processing Systems, 33:16857–16867, 2020.
  28. From sentence to action: Splitting amr graphs for recipe instructions. In Proceedings of the Fourth International Workshop on Designing Meaning Representations, pages 52–67, 2023.
  29. Look at what i’m doing: Self-supervised spatial grounding of narrations in instructional videos. pages 14476–14487, 2021.
  30. Coin: A large-scale dataset for comprehensive instructional video analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1207–1216, 2019.
  31. Scale efficiently: Insights from pre-training and fine-tuning transformers. arXiv preprint arXiv:2109.10686, 2021.
  32. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015.
  33. End-to-end dense video captioning with parallel decoding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6847–6857, 2021.
  34. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019.
  35. Vidchapters-7m: Video chapters at scale. In NeurIPS, 2023a.
  36. Vid2seq: Large-scale pretraining of a visual language model for dense video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10714–10726, 2023b.
  37. Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022.
  38. Merlot reserve: Neural script knowledge through vision and language and sound. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16375–16387, 2022.
  39. Varifocalnet: An iou-aware dense object detector. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8514–8523, 2021.
  40. Reasoning about goals, steps, and temporal ordering with wikihow. arXiv preprint arXiv:2009.07690, 2020.
  41. Causal reasoning of entities and events in procedural texts. arXiv preprint arXiv:2301.10896, 2023.
  42. P3iv: Probabilistic procedure planning from instructional videos with weak supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2938–2948, 2022.
  43. Towards automatic learning of procedures from web instructional videos. In AAAI Conference on Artificial Intelligence, 2018.
  44. Deformable detr: Deformable transformers for end-to-end object detection. In International Conference on Learning Representations, 2021.
  45. Cross-task weakly supervised learning from instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3537–3545, 2019.

Summary

We haven't generated a summary for this paper yet.