LAVE: LLM-Powered Agent Assistance and Language Augmentation for Video Editing (2402.10294v1)
Abstract: Video creation has become increasingly popular, yet the expertise and effort required for editing often pose barriers to beginners. In this paper, we explore the integration of LLMs into the video editing workflow to reduce these barriers. Our design vision is embodied in LAVE, a novel system that provides LLM-powered agent assistance and language-augmented editing features. LAVE automatically generates language descriptions for the user's footage, serving as the foundation for enabling the LLM to process videos and assist in editing tasks. When the user provides editing objectives, the agent plans and executes relevant actions to fulfill them. Moreover, LAVE allows users to edit videos through either the agent or direct UI manipulation, providing flexibility and enabling manual refinement of agent actions. Our user study, which included eight participants ranging from novices to proficient editors, demonstrated LAVE's effectiveness. The results also shed light on user perceptions of the proposed LLM-assisted editing paradigm and its impact on users' creativity and sense of co-creation. Based on these findings, we propose design implications to inform the future development of agent-assisted content editing.
- 2023. Adobe Premiere Pro. https://www.adobe.com/products/premiere.html
- 2023. ChromaDB. https://www.trychroma.com/
- 2023. Final Cut Pro. https://www.apple.com/final-cut-pro/
- 2023. Function calling and other API updates. https://openai.com/blog/function-calling-and-other-api-updates
- 2023. Gen-2 Runway. https://runwayml.com/ai-magic-tools/gen-2/
- 2023. Langchain. https://www.langchain.com/
- Large language models are zero-shot clinical information extractors. arXiv preprint arXiv:2205.12689 (2022).
- Guidelines for human-AI interaction. In Proceedings of the 2019 chi conference on human factors in computing systems. 1–13.
- ChainForge: A Visual Toolkit for Prompt Engineering and LLM Hypothesis Testing. arXiv preprint arXiv:2309.09128 (2023).
- Akanksha Bisoyi. 2022. Ownership, liability, patentability, and creativity issues in artificial intelligence. Information Security Journal: A Global Perspective 31, 4 (2022), 377–386.
- Promptify: Text-to-Image Generation through Interactive Prompt Exploration with Large Language Models. arXiv:2304.09337 [cs.HC]
- ChemCrow: Augmenting large-language models with chemistry tools. arXiv preprint arXiv:2304.05376 (2023).
- Language Models are Few-Shot Learners. arXiv:2005.14165 [cs.CL]
- Nine potential pitfalls when designing human-ai co-creative systems. arXiv preprint arXiv:2104.00358 (2021).
- Creativity Support in the Age of Large Language Models: An Empirical Study Involving Emerging Writers. arXiv:2309.12570 [cs.HC]
- RubySlippers: Supporting Content-Based Voice Navigation for How-to Videos. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (Yokohama, Japan) (CHI ’21). Association for Computing Machinery, New York, NY, USA, Article 97, 14 pages. https://doi.org/10.1145/3411764.3445131
- How to Design Voice Based Navigation for How-To Videos. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (Glasgow, Scotland Uk) (CHI ’19). Association for Computing Machinery, New York, NY, USA, 1–11. https://doi.org/10.1145/3290605.3300931
- Erin Cherry and Celine Latulipe. 2014. Quantifying the Creativity Support of Digital Tools through the Creativity Support Index. ACM Trans. Comput.-Hum. Interact. 21, 4, Article 21 (jun 2014), 25 pages. https://doi.org/10.1145/2617588
- Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. https://lmsys.org/blog/2023-03-30-vicuna/
- TaleBrush: Sketching Stories with Generative Pretrained Language Models. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (New Orleans, LA, USA) (CHI ’22). Association for Computing Machinery, New York, NY, USA, Article 209, 19 pages. https://doi.org/10.1145/3491102.3501819
- Jason K Eshraghian. 2020. Human ownership of artificial creativity. Nature Machine Intelligence 2, 3 (2020), 157–160.
- WinoQueer: A Community-in-the-Loop Benchmark for Anti-LGBTQ+ Bias in Large Language Models. arXiv preprint arXiv:2306.15087 (2023).
- Text-Based Editing of Talking-Head Video. ACM Trans. Graph. 38, 4, Article 68 (jul 2019), 14 pages. https://doi.org/10.1145/3306346.3323028
- Ella Glikson and Anita Williams Woolley. 2020. Human trust in artificial intelligence: Review of empirical research. Academy of Management Annals 14, 2 (2020), 627–660.
- Imagen Video: High Definition Video Generation with Diffusion Models. arXiv:2210.02303 [cs.CV]
- AI song contest: Human-AI co-creation in songwriting. arXiv preprint arXiv:2010.05388 (2020).
- Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents. arXiv:2201.07207 [cs.LG]
- B-script: Transcript-based b-roll video editing with recommendations. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1–11.
- AVscript: Accessible Video Editing with Audio-Visual Scripts. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (Hamburg, Germany) (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 796, 17 pages. https://doi.org/10.1145/3544548.3581494
- Hyunjin Kang and Chen Lou. 2022. AI agency vs. human agency: understanding human–AI interactions on TikTok and their implications for user engagement. Journal of Computer-Mediated Communication 27, 5 (2022), zmac014.
- MRKL Systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning. arXiv:2205.00445 [cs.CL]
- Conceptual metaphors impact perceptions of human-AI collaboration. Proceedings of the ACM on Human-Computer Interaction 4, CSCW2 (2020), 1–26.
- EvalLM: Interactive Evaluation of Large Language Model Prompts on User-Defined Criteria. arXiv preprint arXiv:2309.13633 (2023).
- Large Language Models are Zero-Shot Reasoners. arXiv:2205.11916 [cs.CL]
- PixelTone: A Multimodal Interface for Image Editing. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Paris, France) (CHI ’13). Association for Computing Machinery, New York, NY, USA, 2185–2194. https://doi.org/10.1145/2470654.2481301
- Implicit Representations of Meaning in Neural Language Models. arXiv:2106.00737 [cs.CL]
- BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. arXiv:2301.12597 [cs.CV]
- A Zero-Shot Language Agent for Computer Control with Structured Reflection. arXiv:2310.08740 [cs.CL]
- Identifying Multimodal Context Awareness Requirements for Supporting User Interaction with Procedural Videos. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–17.
- Bingjie Liu. 2021. In AI we trust? Effects of agency locus and transparency on uncertainty reduction in human–AI interaction. Journal of Computer-Mediated Communication 26, 6 (2021), 384–402.
- Visual Instruction Tuning. (2023).
- Generative Disco: Text-to-Video Generation for Music Visualization. arXiv preprint arXiv:2304.08551 (2023).
- Opal: Multimodal image generation for news illustration. In Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology. 1–17.
- 3DALL-E: Integrating text-to-image AI in 3D design workflows. In Proceedings of the 2023 ACM designing interactive systems conference. 1955–1977.
- Cutting down on prompts and parameters: Simple few-shot learning with language models. arXiv preprint arXiv:2106.13353 (2021).
- Róisín Loughran. 2022. Bias and Creativity.. In ICCC. 354–358.
- Novice-AI Music Co-Creation via AI-Steering Tools for Deep Generative Models. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (Honolulu, HI, USA) (CHI ’20). Association for Computing Machinery, New York, NY, USA, 1–13. https://doi.org/10.1145/3313831.3376739
- Expressive Communication: Evaluating Developments in Generative Models and Steering Interfaces for Music Creation. In 27th International Conference on Intelligent User Interfaces (Helsinki, Finland) (IUI ’22). Association for Computing Machinery, New York, NY, USA, 405–417. https://doi.org/10.1145/3490099.3511159
- Humans as Creativity Gatekeepers: Are We Biased Against AI Creativity? Journal of Business and Psychology (2023), 1–14.
- Autonomy, authenticity, authorship and intention in computer generated art. In International conference on computational intelligence in music, sound, art and design (part of EvoStar). Springer, 35–50.
- Co-Writing Screenplays and Theatre Scripts with Language Models: Evaluation by Industry Professionals. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (Hamburg, Germany) (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 355, 34 pages. https://doi.org/10.1145/3544548.3581225
- Show Your Work: Scratchpads for Intermediate Computation with Language Models. arXiv:2112.00114 [cs.LG]
- OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]
- Generative agents: Interactive simulacra of human behavior. arXiv preprint arXiv:2304.03442 (2023).
- Identifying challenges and opportunities in human-AI collaboration in healthcare. In Conference Companion Publication of the 2019 on Computer Supported Cooperative Work and Social Computing. 506–510.
- Rescribe: Authoring and Automatically Editing Audio Descriptions. In Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology (Virtual Event, USA) (UIST ’20). Association for Computing Machinery, New York, NY, USA, 747–759. https://doi.org/10.1145/3379337.3415864
- Jeba Rezwana and Mary Lou Maher. 2022. Identifying ethical issues in ai partners in human-ai co-creation. arXiv preprint arXiv:2204.07644 (2022).
- How Much Knowledge Can You Pack Into the Parameters of a Language Model?. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 5418–5426. https://doi.org/10.18653/v1/2020.emnlp-main.437
- From Pixels to UI Actions: Learning to Follow Instructions via Graphical User Interfaces. arXiv preprint arXiv:2306.00245 (2023).
- HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face. arXiv:2303.17580 [cs.CL]
- Byte Pair encoding: A text compression scheme that accelerates pattern matching. (1999).
- Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366 (2023).
- Make-A-Video: Text-to-Video Generation without Text-Video Data. arXiv:2209.14792 [cs.CV]
- Llm-planner: Few-shot grounded planning for embodied agents with large language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2998–3009.
- CodeToon: Story Ideation, Auto Comic Generation, and Structure Mapping for Code-Driven Storytelling. In Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology (Bend, OR, USA) (UIST ’22). Association for Computing Machinery, New York, NY, USA, Article 13, 16 pages. https://doi.org/10.1145/3526113.3545617
- Evaluating the Factual Consistency of Large Language Models Through Summarization. arXiv:2211.08412 [cs.CL]
- LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971 [cs.CL]
- QuickCut: An Interactive Tool for Editing Narrated Video. In Proceedings of the 29th Annual Symposium on User Interface Software and Technology (Tokyo, Japan) (UIST ’16). Association for Computing Machinery, New York, NY, USA, 497–507. https://doi.org/10.1145/2984511.2984569
- The think aloud method: a practical approach to modelling cognitive. London: AcademicPress 11 (1994), 29–41.
- Nationality Bias in Text Generation. arXiv preprint arXiv:2302.02463 (2023).
- Record Once, Post Everywhere: Automatic Shortening of Audio Stories for Social Media. In Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology (Bend, OR, USA) (UIST ’22). Association for Computing Machinery, New York, NY, USA, Article 14, 11 pages. https://doi.org/10.1145/3526113.3545680
- Enabling Conversational Interaction with Mobile UI Using Large Language Models. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (Hamburg, Germany) (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 432, 17 pages. https://doi.org/10.1145/3544548.3580895
- Human-AI Collaboration in Data Science: Exploring Data Scientists’ Perceptions of Automated AI. Proc. ACM Hum.-Comput. Interact. 3, CSCW, Article 211 (nov 2019), 24 pages. https://doi.org/10.1145/3359313
- Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models. arXiv:2305.04091 [cs.CL]
- Write-a-video: computational video montage from themed text. ACM Trans. Graph. 38, 6 (2019), 177–1.
- ReelFramer: Co-creating News Reels on Social Media with Generative AI. arXiv preprint arXiv:2304.09653 (2023).
- Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv:2201.11903 [cs.CL]
- Haijun Xia. 2020. Crosspower: Bridging Graphics and Linguistics. In Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology (Virtual Event, USA) (UIST ’20). Association for Computing Machinery, New York, NY, USA, 722–734. https://doi.org/10.1145/3379337.3415845
- Crosscast: adding visuals to audio travel podcasts. In Proceedings of the 33rd annual ACM symposium on user interface software and technology. 735–746.
- The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421 9, 1 (2023).
- ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629 [cs.CL]
- Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias. arXiv preprint arXiv:2306.15895 (2023).
- Wordcraft: Story Writing With Large Language Models. In 27th International Conference on Intelligent User Interfaces (Helsinki, Finland) (IUI ’22). Association for Computing Machinery, New York, NY, USA, 841–852. https://doi.org/10.1145/3490099.3511105
- Why Johnny can’t prompt: how non-AI experts try (and fail) to design LLM prompts. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–21.
- Storybuddy: A human-ai collaborative chatbot for parent-child interactive storytelling with flexible parental involvement. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems. 1–21.
- Explainable AI for Designers: A Human-Centered Perspective on Mixed-Initiative Co-Creation. In 2018 IEEE Conference on Computational Intelligence and Games (CIG). 1–8. https://doi.org/10.1109/CIG.2018.8490433
- Bryan Wang (25 papers)
- Yuliang Li (36 papers)
- Zhaoyang Lv (24 papers)
- Haijun Xia (24 papers)
- Yan Xu (258 papers)
- Raj Sodhi (1 paper)