Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Autonomous Workflow for Multimodal Fine-Grained Training Assistants Towards Mixed Reality (2405.13034v2)

Published 16 May 2024 in cs.CL, cs.AI, and cs.HC

Abstract: Autonomous AI agents have emerged as promising protocols for automatically understanding the language-based environment, particularly with the exponential development of LLMs. However, a fine-grained, comprehensive understanding of multimodal environments remains under-explored. This work designs an autonomous workflow tailored for integrating AI agents seamlessly into extended reality (XR) applications for fine-grained training. We present a demonstration of a multimodal fine-grained training assistant for LEGO brick assembly in a pilot XR environment. Specifically, we design a cerebral language agent that integrates LLM with memory, planning, and interaction with XR tools and a vision-language agent, enabling agents to decide their actions based on past experiences. Furthermore, we introduce LEGO-MRTA, a multimodal fine-grained assembly dialogue dataset synthesized automatically in the workflow served by a commercial LLM. This dataset comprises multimodal instruction manuals, conversations, XR responses, and vision question answering. Last, we present several prevailing open-resource LLMs as benchmarks, assessing their performance with and without fine-tuning on the proposed dataset. We anticipate that the broader impact of this workflow will advance the development of smarter assistants for seamless user interaction in XR environments, fostering research in both AI and HCI communities.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. Falcon-40B: an open large language model with state-of-the-art performance.
  2. extended reality of socio-motor interactions: Current trends and ethical considerations for mixed reality environments design. In ICMI, pages 154–158.
  3. Qwen technical report. arXiv preprint arXiv:2309.16609.
  4. Artificial intelligence, cyber-threats and industry 4.0: Challenges and opportunities. Artificial Intelligence Review, 54(5):3849–3886.
  5. Fiaar: an augmented reality firetruck equipment assembly and configuration assistant technology. In CogInfoCom, pages 000237–000244. IEEE.
  6. Augmented reality for the manufacturing industry: the case of an assembly assistant. In VRW, pages 299–304. IEEE.
  7. Systematic review of augmented reality training systems. IEEE Transactions on Visualization and Computer Graphics.
  8. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478.
  9. Human-technology integration with industrial conversational agents: A conceptual architecture and a taxonomy for manufacturing. Journal of Industrial Information Integration, 35:100510.
  10. Using augmented reality with speech input for non-native children’s language learning. International Journal of Human-Computer Studies, 134:44–64.
  11. Chinese Whispers: A Multimodal Dataset for Embodied Language Grounding. In LREC.
  12. Towards next-generation intelligent assistants leveraging llm techniques. In SIGKDD, pages 5792–5793.
  13. A survey on remote assistance and training in mixed reality environments. IEEE Transactions on Visualization and Computer Graphics, 29(5):2291–2303.
  14. Working with augmented reality? a long-term analysis of in-situ instructions at the assembly workplace. In PETRA, pages 222–229.
  15. Evaluating virtual reality and augmented reality training for industrial maintenance and assembly tasks. Interactive Learning Environments, 23(6):778–798.
  16. Ruchi Goel and Pooja Gupta. 2020. Robotics and industry 4.0. A Roadmap to Industry 4.0: Smart Production, Sharp Business and Sustainable Development, pages 157–169.
  17. Mindagent: Emergent gaming interaction. arXiv preprint arXiv:2309.09971.
  18. Hiyouga. 2023. Llama factory. https://github.com/hiyouga/LLaMA-Factory.
  19. Lei Hou and Xiangyu Wang. 2013. A study on the benefits of augmented reality in retaining working memory in assembly tasks: A focus on differences in gender. Automation in Construction, 32:38–45.
  20. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. In ICML, pages 9118–9147. PMLR.
  21. Artificial intelligence for industry 4.0: Systematic review of applications, challenges, and opportunities. Expert Systems with Applications, 216:119456.
  22. Mistral 7b. arXiv preprint arXiv:2310.06825.
  23. A multimodal corpus for mutual gaze and joint attention in multiparty situated interaction. In LREC.
  24. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv.2211.05100.
  25. Integrating large language models and metaverse in autonomous racing: An education-oriented perspective. IEEE Transactions on Intelligent Vehicles.
  26. Bringing a natural language-enabled virtual assistant to industrial mobile robots for learning, training and assistance of manufacturing tasks. In SII, pages 238–243. IEEE.
  27. How can i help you? an intelligent virtual assistant for industrial robots. In HRI, pages 220–224.
  28. Chen Li and Hong Ji Yang. 2021. Bot-x: An ai-based virtual assistant for intelligent manufacturing. Multiagent and Grid Systems, 17(1):1–14.
  29. M3dbench: Let’s instruct large models with multi-modal 3d prompts. arXiv preprint arXiv:2312.10763.
  30. Metaagents: Simulating interactions of human behaviors for llm-based task-oriented coordination via collaborative generative agents. arXiv preprint arXiv:2310.06500.
  31. Bolaa: Benchmarking and orchestrating llm-augmented autonomous agents. arXiv preprint arXiv:2308.05960.
  32. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651.
  33. Llm-powered conversational voice assistants: Interaction patterns, opportunities, challenges, and design guidelines. arXiv preprint arXiv:2309.13879.
  34. Testing language model agents safely in the wild. arXiv preprint arXiv:2311.10538.
  35. Collaborative dialogue in minecraft. In ACL, pages 5405–5415.
  36. A comprehensive overview of large language models. arXiv preprint arXiv:2307.06435.
  37. Virtual reality training system using an autonomy agent for learning hospitality skills of a retail store. In HCI, pages 483–492. Springer.
  38. TEACh: Task-driven Embodied Agents that Chat. In AAAI, volume 36, pages 2017–2025.
  39. ARMI: An Architecture for Recording Multimodal Interactions. In LREC.
  40. Stephanie Schreitter and Brigitte Krenn. 2016. The ofai multi-modal task description corpus. In LREC, pages 1408–1414.
  41. Reflexion: Language agents with verbal reinforcement learning. In NeuralIPS.
  42. Adaptive virtual assistant for virtual reality-based remote learning. In ASEE Annual Conference & Exposition.
  43. Using ar/vr for technical subjects in vocational training–of substancial benefit or just another technical gimmick? In EDUCON, pages 557–561. IEEE.
  44. Executing instructions in situated collaborative interactions. In EMNLP-IJCNLP, pages 2119–2130.
  45. Jointly improving parsing and perception for natural language commands through human-robot dialog. Journal of Artificial Intelligence Research, 67:327–374.
  46. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  47. Openchat: Advancing open-source language models with mixed-quality data. arXiv preprint arXiv:2309.11235.
  48. A survey on large language model based autonomous agents. arXiv preprint arXiv:2308.11432.
  49. Holoassist: an egocentric human interaction dataset for interactive ai assistants in the real world. In ICCV, pages 20270–20281.
  50. Editable scene simulation for autonomous driving via collaborative llm-agents. arXiv preprint arXiv:2402.05746.
  51. The rise and potential of large language model based agents: A survey. arXiv preprint arXiv:2309.07864.
  52. Urban generative intelligence (ugi): A foundational platform for agents in embodied city environment. arXiv preprint arXiv:2312.11813.
  53. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629.
  54. From human-human collaboration to human-robot collaboration: automated generation of assembly task knowledge model. In ICAC, pages 1–6. IEEE.
  55. Dialogstudio: Towards richest and most diverse unified dataset collection for conversational ai. arXiv preprint arXiv:2307.10172.
  56. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
  57. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592.
  58. 3d-vista: Pre-trained transformer for 3d vision and text alignment. In ICCV.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets