Large Language Models as Zero-shot Dialogue State Tracker through Function Calling (2402.10466v4)
Abstract: LLMs are increasingly prevalent in conversational systems due to their advanced understanding and generative capabilities in general contexts. However, their effectiveness in task-oriented dialogues (TOD), which requires not only response generation but also effective dialogue state tracking (DST) within specific tasks and domains, remains less satisfying. In this work, we propose a novel approach FnCTOD for solving DST with LLMs through function calling. This method improves zero-shot DST, allowing adaptation to diverse domains without extensive data collection or model tuning. Our experimental results demonstrate that our approach achieves exceptional performance with both modestly sized open-source and also proprietary LLMs: with in-context prompting it enables various 7B or 13B parameter models to surpass the previous state-of-the-art (SOTA) achieved by ChatGPT, and improves ChatGPT's performance beating the SOTA by 5.6% average joint goal accuracy (JGA). Individual model results for GPT-3.5 and GPT-4 are boosted by 4.8% and 14%, respectively. We also show that by fine-tuning on a small collection of diverse task-oriented dialogues, we can equip modestly sized models, specifically a 13B parameter LLaMA2-Chat model, with function-calling capabilities and DST performance comparable to ChatGPT while maintaining their chat capabilities. We have made the code publicly available at https://github.com/facebookresearch/FnCTOD
- Baichuan. 2023. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305.
- A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Multiwoz–a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling. arXiv preprint arXiv:1810.00278.
- Taskmaster-1: Toward a realistic and diverse dialog dataset. arXiv preprint arXiv:1909.05358.
- Zero-shot transfer learning with synthesized data for multi-domain dialogue state tracking. arXiv preprint arXiv:2005.00891.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
- Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113.
- Instructtods: Large language models for end-to-end task-oriented dialogue systems. arXiv preprint arXiv:2310.08885.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
- MultiWOZ 2.1: A consolidated multi-domain dialogue dataset with state corrections and state tracking baselines. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 422–428, Marseille, France. European Language Resources Association.
- Chatgpt for zero-shot dialogue state tracking: A solution or an opportunity? arXiv preprint arXiv:2306.01386.
- Trippy: A triple copy strategy for value independent neural dialog state tracking. arXiv preprint arXiv:2005.02877.
- A simple language model for task-oriented dialogue. Advances in Neural Information Processing Systems, 33:20179–20191.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
- In-context learning for few-shot dialogue state tracking. arXiv preprint arXiv:2203.08568.
- Vojtěch Hudeček and Ondřej Dušek. 2023. Are llms all you need for task-oriented dialogue? arXiv preprint arXiv:2304.06556.
- Mistral 7b. arXiv preprint arXiv:2310.06825.
- Ma-dst: Multi-attention-based scalable dialog state tracking. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 8107–8114.
- Sumbt: Slot-utterance matching for universal and scalable belief tracking. arXiv preprint arXiv:1907.07421.
- Api-bank: A comprehensive benchmark for tool-augmented llms. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3102–3116.
- Microsoft dialogue challenge: Building end-to-end task-completion dialogue systems. arXiv preprint arXiv:1807.11125.
- Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval.
- Controllable dialogue simulation with in-context learning. arXiv preprint arXiv:2210.04185.
- Guiding large language models via directional stimulus prompting. arXiv preprint arXiv:2302.11520.
- Zero-shot dialogue state tracking via cross-task transfer. arXiv preprint arXiv:2109.04655.
- Zero-shot dialogue state tracking via cross-task transfer. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7890–7900, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Leveraging slot descriptions for zero-shot cross-domain dialogue StateTracking. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5640–5648, Online. Association for Computational Linguistics.
- Mintl: Minimalist transfer learning for task-oriented dialogue systems. arXiv preprint arXiv:2009.12005.
- Neural belief tracker: Data-driven dialogue state tracking. arXiv preprint arXiv:1606.03777.
- OpenAI. 2023. Gpt-4 technical report.
- Talm: Tool augmented language models. arXiv preprint arXiv:2205.12255.
- Gorilla: Large language model connected with massive apis. arXiv preprint arXiv:2305.15334.
- Soloist: Few-shot task-oriented dialog with a single pretrained auto-regressive model. arXiv preprint arXiv:2005.05298, 3.
- Towards scalable multi-domain conversational agents: The schema-guided dialogue dataset. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 8689–8696.
- Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761.
- Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580.
- Dialogue summaries as dialogue states (DS2), template-guided summarization for few-shot dialogue state tracking. In Findings of the Association for Computational Linguistics: ACL 2022, pages 3824–3846, Dublin, Ireland. Association for Computational Linguistics.
- Multi-task pre-training for plug-and-play task-oriented dialogue system. arXiv preprint arXiv:2109.14739.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944.
- Conditional generation and snapshot learning in neural dialogue systems. arXiv preprint: 1606.03352.
- A network-based end-to-end trainable task-oriented dialogue system. arXiv preprint: 1604.04562.
- Improving limited labeled dialogue state tracking with self-supervision. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4462–4472, Online. Association for Computational Linguistics.
- Transferable multi-domain state generator for task-oriented dialogue systems. arXiv preprint arXiv:1905.08743.
- Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305.
- Multiwoz 2.2: A dialogue dataset with additional annotation corrections and state tracking baselines. arXiv preprint arXiv:2007.12720.
- Sgp-tod: Building task bots effortlessly via schema-guided llm prompting. arXiv preprint arXiv:2305.09067.
- Description-driven task-oriented dialog modeling. arXiv preprint arXiv:2201.08904.