Emergent Mind

Abstract

We present DIALIGHT, a toolkit for developing and evaluating multilingual Task-Oriented Dialogue (ToD) systems which facilitates systematic evaluations and comparisons between ToD systems using fine-tuning of Pretrained Language Models (PLMs) and those utilising the zero-shot and in-context learning capabilities of LLMs. In addition to automatic evaluation, this toolkit features (i) a secure, user-friendly web interface for fine-grained human evaluation at both local utterance level and global dialogue level, and (ii) a microservice-based backend, improving efficiency and scalability. Our evaluations reveal that while PLM fine-tuning leads to higher accuracy and coherence, LLM-based systems excel in producing diverse and likeable responses. However, we also identify significant challenges of LLMs in adherence to task-specific instructions and generating outputs in multiple languages, highlighting areas for future research. We hope this open-sourced toolkit will serve as a valuable resource for researchers aiming to develop and properly evaluate multilingual ToD systems and will lower, currently still high, entry barriers in the field.

Overview

  • Introduces a new toolkit, referred to as oolkit, for developing and evaluating multilingual Task-Oriented Dialogue (ToD) systems.

  • Combines both automatic and human evaluation metrics, including Joint Goal Accuracy, BLEU, and METEOR scores.

  • Supports multilingual evaluation, adding capabilities for languages such as Arabic, French, Turkish, and English.

  • Reveals that while PLM-based systems are more accurate and coherent, LLM-based systems create more diverse and likable responses but struggle with task-specific multilingual outputs.

  • Invites contributions from the research community to this open-source toolkit and highlights the potential for future advancements in conversational AI.

Introduction to \toolkit

The development and evaluation of Task-Oriented Dialogue (ToD) systems are crucial for creating efficient and user-friendly AI-driven conversational agents. In light of this, researchers have introduced \toolkit, a novel toolkit, to streamline the process of building and benchmarking multilingual ToD systems. This toolkit is engineered to facilitate comparisons between systems that fine-tune Pretrained Language Models (PLMs) and those incorporating the more recent method of leveraging the zero-shot and in-context learning capabilities of LLMs.

\toolkit Features and Capabilities

One of the most notable features of \toolkit is its dual-focused evaluation methodology that combines automatic and human evaluation metrics. The automatic evaluation covers a variety of benchmarks, including Joint Goal Accuracy, BLEU, and METEOR scores, among others. The human evaluation aspect is further bolstered with a secure and intuitive web interface that allows assessments at both utterance and dialogue levels, ensuring a granular and holistic analysis of ToD systems.

Crucially, this toolkit supports multilingual development, bringing the ability to evaluate systems in languages such as Arabic, French, and Turkish, in addition to English. This is a significant step toward addressing the performance disparities observed in non-English ToD systems. It also leverages a microservice-based backend, which enhances efficiency and scalability, making it a robust resource for researchers.

Comparative Analysis of ToD Systems

The toolkit has already been employed in carrying out systematic evaluations. The findings suggest that while ToD systems fine-tuned on specific PLMs generally display higher accuracy and coherence, LLM-based systems excel in generating more diverse and likable responses. However, LLMs present their own set of challenges, particularly when it comes to faithfully following task-specific instructions and providing multilingual outputs.

Looking Forward

The introduction of \toolkit is poised to lower entry barriers in the field and provide valuable insights into the development of ToD systems. While \toolkit allows for in-depth comparative research and could pave the way for improvements in multilingual ToD systems, the gaps identified in current research highlight the need for future studies that refine the use of LLMs, especially in tasks that require strict adherence to guidelines in diverse linguistic contexts.

The toolkit is an open-source resource, and the creators encourage adaptation and contributions from the broader research community to extend its capabilities and applications. With the groundwork laid by \toolkit, the field of conversational AI is positioned for exciting developments, particularly in building systems that can interact effectively across numerous languages and cultures.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.