Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LAVE: LLM-Powered Agent Assistance and Language Augmentation for Video Editing (2402.10294v1)

Published 15 Feb 2024 in cs.HC, cs.AI, cs.CL, and cs.MM

Abstract: Video creation has become increasingly popular, yet the expertise and effort required for editing often pose barriers to beginners. In this paper, we explore the integration of LLMs into the video editing workflow to reduce these barriers. Our design vision is embodied in LAVE, a novel system that provides LLM-powered agent assistance and language-augmented editing features. LAVE automatically generates language descriptions for the user's footage, serving as the foundation for enabling the LLM to process videos and assist in editing tasks. When the user provides editing objectives, the agent plans and executes relevant actions to fulfill them. Moreover, LAVE allows users to edit videos through either the agent or direct UI manipulation, providing flexibility and enabling manual refinement of agent actions. Our user study, which included eight participants ranging from novices to proficient editors, demonstrated LAVE's effectiveness. The results also shed light on user perceptions of the proposed LLM-assisted editing paradigm and its impact on users' creativity and sense of co-creation. Based on these findings, we propose design implications to inform the future development of agent-assisted content editing.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (86)
  1. 2023. Adobe Premiere Pro. https://www.adobe.com/products/premiere.html
  2. 2023. ChromaDB. https://www.trychroma.com/
  3. 2023. Final Cut Pro. https://www.apple.com/final-cut-pro/
  4. 2023. Function calling and other API updates. https://openai.com/blog/function-calling-and-other-api-updates
  5. 2023. Gen-2 Runway. https://runwayml.com/ai-magic-tools/gen-2/
  6. 2023. Langchain. https://www.langchain.com/
  7. Large language models are zero-shot clinical information extractors. arXiv preprint arXiv:2205.12689 (2022).
  8. Guidelines for human-AI interaction. In Proceedings of the 2019 chi conference on human factors in computing systems. 1–13.
  9. ChainForge: A Visual Toolkit for Prompt Engineering and LLM Hypothesis Testing. arXiv preprint arXiv:2309.09128 (2023).
  10. Akanksha Bisoyi. 2022. Ownership, liability, patentability, and creativity issues in artificial intelligence. Information Security Journal: A Global Perspective 31, 4 (2022), 377–386.
  11. Promptify: Text-to-Image Generation through Interactive Prompt Exploration with Large Language Models. arXiv:2304.09337 [cs.HC]
  12. ChemCrow: Augmenting large-language models with chemistry tools. arXiv preprint arXiv:2304.05376 (2023).
  13. Language Models are Few-Shot Learners. arXiv:2005.14165 [cs.CL]
  14. Nine potential pitfalls when designing human-ai co-creative systems. arXiv preprint arXiv:2104.00358 (2021).
  15. Creativity Support in the Age of Large Language Models: An Empirical Study Involving Emerging Writers. arXiv:2309.12570 [cs.HC]
  16. RubySlippers: Supporting Content-Based Voice Navigation for How-to Videos. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (Yokohama, Japan) (CHI ’21). Association for Computing Machinery, New York, NY, USA, Article 97, 14 pages. https://doi.org/10.1145/3411764.3445131
  17. How to Design Voice Based Navigation for How-To Videos. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (Glasgow, Scotland Uk) (CHI ’19). Association for Computing Machinery, New York, NY, USA, 1–11. https://doi.org/10.1145/3290605.3300931
  18. Erin Cherry and Celine Latulipe. 2014. Quantifying the Creativity Support of Digital Tools through the Creativity Support Index. ACM Trans. Comput.-Hum. Interact. 21, 4, Article 21 (jun 2014), 25 pages. https://doi.org/10.1145/2617588
  19. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. https://lmsys.org/blog/2023-03-30-vicuna/
  20. TaleBrush: Sketching Stories with Generative Pretrained Language Models. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (New Orleans, LA, USA) (CHI ’22). Association for Computing Machinery, New York, NY, USA, Article 209, 19 pages. https://doi.org/10.1145/3491102.3501819
  21. Jason K Eshraghian. 2020. Human ownership of artificial creativity. Nature Machine Intelligence 2, 3 (2020), 157–160.
  22. WinoQueer: A Community-in-the-Loop Benchmark for Anti-LGBTQ+ Bias in Large Language Models. arXiv preprint arXiv:2306.15087 (2023).
  23. Text-Based Editing of Talking-Head Video. ACM Trans. Graph. 38, 4, Article 68 (jul 2019), 14 pages. https://doi.org/10.1145/3306346.3323028
  24. Ella Glikson and Anita Williams Woolley. 2020. Human trust in artificial intelligence: Review of empirical research. Academy of Management Annals 14, 2 (2020), 627–660.
  25. Imagen Video: High Definition Video Generation with Diffusion Models. arXiv:2210.02303 [cs.CV]
  26. AI song contest: Human-AI co-creation in songwriting. arXiv preprint arXiv:2010.05388 (2020).
  27. Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents. arXiv:2201.07207 [cs.LG]
  28. B-script: Transcript-based b-roll video editing with recommendations. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1–11.
  29. AVscript: Accessible Video Editing with Audio-Visual Scripts. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (Hamburg, Germany) (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 796, 17 pages. https://doi.org/10.1145/3544548.3581494
  30. Hyunjin Kang and Chen Lou. 2022. AI agency vs. human agency: understanding human–AI interactions on TikTok and their implications for user engagement. Journal of Computer-Mediated Communication 27, 5 (2022), zmac014.
  31. MRKL Systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning. arXiv:2205.00445 [cs.CL]
  32. Conceptual metaphors impact perceptions of human-AI collaboration. Proceedings of the ACM on Human-Computer Interaction 4, CSCW2 (2020), 1–26.
  33. EvalLM: Interactive Evaluation of Large Language Model Prompts on User-Defined Criteria. arXiv preprint arXiv:2309.13633 (2023).
  34. Large Language Models are Zero-Shot Reasoners. arXiv:2205.11916 [cs.CL]
  35. PixelTone: A Multimodal Interface for Image Editing. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Paris, France) (CHI ’13). Association for Computing Machinery, New York, NY, USA, 2185–2194. https://doi.org/10.1145/2470654.2481301
  36. Implicit Representations of Meaning in Neural Language Models. arXiv:2106.00737 [cs.CL]
  37. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. arXiv:2301.12597 [cs.CV]
  38. A Zero-Shot Language Agent for Computer Control with Structured Reflection. arXiv:2310.08740 [cs.CL]
  39. Identifying Multimodal Context Awareness Requirements for Supporting User Interaction with Procedural Videos. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–17.
  40. Bingjie Liu. 2021. In AI we trust? Effects of agency locus and transparency on uncertainty reduction in human–AI interaction. Journal of Computer-Mediated Communication 26, 6 (2021), 384–402.
  41. Visual Instruction Tuning. (2023).
  42. Generative Disco: Text-to-Video Generation for Music Visualization. arXiv preprint arXiv:2304.08551 (2023).
  43. Opal: Multimodal image generation for news illustration. In Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology. 1–17.
  44. 3DALL-E: Integrating text-to-image AI in 3D design workflows. In Proceedings of the 2023 ACM designing interactive systems conference. 1955–1977.
  45. Cutting down on prompts and parameters: Simple few-shot learning with language models. arXiv preprint arXiv:2106.13353 (2021).
  46. Róisín Loughran. 2022. Bias and Creativity.. In ICCC. 354–358.
  47. Novice-AI Music Co-Creation via AI-Steering Tools for Deep Generative Models. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (Honolulu, HI, USA) (CHI ’20). Association for Computing Machinery, New York, NY, USA, 1–13. https://doi.org/10.1145/3313831.3376739
  48. Expressive Communication: Evaluating Developments in Generative Models and Steering Interfaces for Music Creation. In 27th International Conference on Intelligent User Interfaces (Helsinki, Finland) (IUI ’22). Association for Computing Machinery, New York, NY, USA, 405–417. https://doi.org/10.1145/3490099.3511159
  49. Humans as Creativity Gatekeepers: Are We Biased Against AI Creativity? Journal of Business and Psychology (2023), 1–14.
  50. Autonomy, authenticity, authorship and intention in computer generated art. In International conference on computational intelligence in music, sound, art and design (part of EvoStar). Springer, 35–50.
  51. Co-Writing Screenplays and Theatre Scripts with Language Models: Evaluation by Industry Professionals. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (Hamburg, Germany) (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 355, 34 pages. https://doi.org/10.1145/3544548.3581225
  52. Show Your Work: Scratchpads for Intermediate Computation with Language Models. arXiv:2112.00114 [cs.LG]
  53. OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]
  54. Generative agents: Interactive simulacra of human behavior. arXiv preprint arXiv:2304.03442 (2023).
  55. Identifying challenges and opportunities in human-AI collaboration in healthcare. In Conference Companion Publication of the 2019 on Computer Supported Cooperative Work and Social Computing. 506–510.
  56. Rescribe: Authoring and Automatically Editing Audio Descriptions. In Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology (Virtual Event, USA) (UIST ’20). Association for Computing Machinery, New York, NY, USA, 747–759. https://doi.org/10.1145/3379337.3415864
  57. Jeba Rezwana and Mary Lou Maher. 2022. Identifying ethical issues in ai partners in human-ai co-creation. arXiv preprint arXiv:2204.07644 (2022).
  58. How Much Knowledge Can You Pack Into the Parameters of a Language Model?. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 5418–5426. https://doi.org/10.18653/v1/2020.emnlp-main.437
  59. From Pixels to UI Actions: Learning to Follow Instructions via Graphical User Interfaces. arXiv preprint arXiv:2306.00245 (2023).
  60. HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face. arXiv:2303.17580 [cs.CL]
  61. Byte Pair encoding: A text compression scheme that accelerates pattern matching. (1999).
  62. Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366 (2023).
  63. Make-A-Video: Text-to-Video Generation without Text-Video Data. arXiv:2209.14792 [cs.CV]
  64. Llm-planner: Few-shot grounded planning for embodied agents with large language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2998–3009.
  65. CodeToon: Story Ideation, Auto Comic Generation, and Structure Mapping for Code-Driven Storytelling. In Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology (Bend, OR, USA) (UIST ’22). Association for Computing Machinery, New York, NY, USA, Article 13, 16 pages. https://doi.org/10.1145/3526113.3545617
  66. Evaluating the Factual Consistency of Large Language Models Through Summarization. arXiv:2211.08412 [cs.CL]
  67. LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971 [cs.CL]
  68. QuickCut: An Interactive Tool for Editing Narrated Video. In Proceedings of the 29th Annual Symposium on User Interface Software and Technology (Tokyo, Japan) (UIST ’16). Association for Computing Machinery, New York, NY, USA, 497–507. https://doi.org/10.1145/2984511.2984569
  69. The think aloud method: a practical approach to modelling cognitive. London: AcademicPress 11 (1994), 29–41.
  70. Nationality Bias in Text Generation. arXiv preprint arXiv:2302.02463 (2023).
  71. Record Once, Post Everywhere: Automatic Shortening of Audio Stories for Social Media. In Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology (Bend, OR, USA) (UIST ’22). Association for Computing Machinery, New York, NY, USA, Article 14, 11 pages. https://doi.org/10.1145/3526113.3545680
  72. Enabling Conversational Interaction with Mobile UI Using Large Language Models. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (Hamburg, Germany) (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 432, 17 pages. https://doi.org/10.1145/3544548.3580895
  73. Human-AI Collaboration in Data Science: Exploring Data Scientists’ Perceptions of Automated AI. Proc. ACM Hum.-Comput. Interact. 3, CSCW, Article 211 (nov 2019), 24 pages. https://doi.org/10.1145/3359313
  74. Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models. arXiv:2305.04091 [cs.CL]
  75. Write-a-video: computational video montage from themed text. ACM Trans. Graph. 38, 6 (2019), 177–1.
  76. ReelFramer: Co-creating News Reels on Social Media with Generative AI. arXiv preprint arXiv:2304.09653 (2023).
  77. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv:2201.11903 [cs.CL]
  78. Haijun Xia. 2020. Crosspower: Bridging Graphics and Linguistics. In Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology (Virtual Event, USA) (UIST ’20). Association for Computing Machinery, New York, NY, USA, 722–734. https://doi.org/10.1145/3379337.3415845
  79. Crosscast: adding visuals to audio travel podcasts. In Proceedings of the 33rd annual ACM symposium on user interface software and technology. 735–746.
  80. The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421 9, 1 (2023).
  81. ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629 [cs.CL]
  82. Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias. arXiv preprint arXiv:2306.15895 (2023).
  83. Wordcraft: Story Writing With Large Language Models. In 27th International Conference on Intelligent User Interfaces (Helsinki, Finland) (IUI ’22). Association for Computing Machinery, New York, NY, USA, 841–852. https://doi.org/10.1145/3490099.3511105
  84. Why Johnny can’t prompt: how non-AI experts try (and fail) to design LLM prompts. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–21.
  85. Storybuddy: A human-ai collaborative chatbot for parent-child interactive storytelling with flexible parental involvement. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems. 1–21.
  86. Explainable AI for Designers: A Human-Centered Perspective on Mixed-Initiative Co-Creation. In 2018 IEEE Conference on Computational Intelligence and Games (CIG). 1–8. https://doi.org/10.1109/CIG.2018.8490433
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Bryan Wang (25 papers)
  2. Yuliang Li (36 papers)
  3. Zhaoyang Lv (24 papers)
  4. Haijun Xia (24 papers)
  5. Yan Xu (258 papers)
  6. Raj Sodhi (1 paper)
Citations (15)

Summary

  • The paper introduces LAVE, an LLM-powered agent that simplifies video editing with natural language directives and intelligent clip sequencing.
  • It implements a language-augmented framework that generates semantic titles and automated narratives to improve video content comprehension.
  • User studies show that LAVE effectively lowers the video editing learning curve while preserving creative control through dual interaction modalities.

LAVE: Leveraging LLMs for Enhanced Video Editing Experiences

Introduction to LAVE

Video editing is a dynamic and essential aspect of modern digital communication, yet it presents notable challenges, particularly for novices. The complexity and skill required to navigate advanced editing software can deter potential creators. Addressing this issue, the integration of LLMs into the video editing workflow introduces a transformative approach, making it more accessible and reducing the barriers for beginners. This is encapsulated in the development of LAVE, a system that embodies the potential of LLM-powered agent assistance and language augmentation to simplify and enhance the video editing process.

System Design and Key Features

LAVE's architecture is designed around the goal of harnessing natural language to streamline video editing. It achieves this through several innovative components:

  • Language-Augmented Video Gallery: Automatically generated textual narrations provide semantic titles and summaries for the user's footage, facilitating an intuitive grasp of the video content without the need to manually scrub through clips.
  • Video Editing Timeline: Offers both manual editing capabilities and LLM-based planning and execution features, catering to diverse user preferences and maintaining the creative intent of the video editor.
  • Video Editing Agent: A conversational agent assists users throughout the editing process. Capable of understanding free-form language commands, the agent efficiently plans and executes a range of editing actions based on user objectives.

Implementation Insights

At the core of LAVE is its LLM-powered computational pipeline, which automates tasks such as brainstorming, semantic-based video retrieval, and clip sequencing. The use of visual-LLMs (VLMs) is particularly noteworthy, enabling the system to generate comprehensive language descriptions of video content. This lays a linguistic foundation for the LLM, facilitating the understanding of video material and significantly enhancing the editing process.

Evaluation and User Experiences

A user paper with participants of varying editing expertise revealed positive feedback regarding LAVE's effectiveness. Users appreciated the flexibility offered by the dual interaction modalities—agent assistance and direct manipulation—highlighting LAVE’s role in fostering creativity and the sense of co-creation with AI. The paper also underlined the importance of providing adaptive agent support, recognizing the diversity in user needs and preferences across different editing tasks.

Research Implications and Future Directions

LAVE's development and user paper provide several insights for the future of agent-assisted content editing. Key among these is the potential for natural language to significantly lower the barriers to complex creative tasks, such as video editing. Furthermore, the adaptive nature of agent support and the importance of preserving user agency in the creative process emerge as critical design considerations. Looking forward, the field stands on the cusp of major advancements, with the integration of more sophisticated LLMs and VLMs presenting promising opportunities to further streamline and enhance the video editing experience.

Conclusion

LAVE’s exploration into LLM-powered video editing represents a significant step toward democratizing video creation. By aligning the linguistic capabilities of LLMs with the visual narrative of video content, LAVE not only simplifies the editing process but also opens up new avenues for creative expression. As the technology evolves, the integration of AI in creative processes promises to unlock unprecedented opportunities for content creators, making the act of creation more accessible and enjoyable for all.

Youtube Logo Streamline Icon: https://streamlinehq.com