EventChat: Implementation and user-centric evaluation of a large language model-driven conversational recommender system for exploring leisure events in an SME context (2407.04472v3)

Published 5 Jul 2024 in cs.IR, cs.AI, cs.CL, and cs.LG

Abstract: LLMs present an enormous evolution in the strategic potential of conversational recommender systems (CRS). Yet to date, research has predominantly focused upon technical frameworks to implement LLM-driven CRS, rather than end-user evaluations or strategic implications for firms, particularly from the perspective of a small to medium enterprises (SME) that makeup the bedrock of the global economy. In the current paper, we detail the design of an LLM-driven CRS in an SME setting, and its subsequent performance in the field using both objective system metrics and subjective user evaluations. While doing so, we additionally outline a short-form revised ResQue model for evaluating LLM-driven CRS, enabling replicability in a rapidly evolving field. Our results reveal good system performance from a user experience perspective (85.5% recommendation accuracy) but underscore latency, cost, and quality issues challenging business viability. Notably, with a median cost of $0.04 per interaction and a latency of 5.7s, cost-effectiveness and response time emerge as crucial areas for achieving a more user-friendly and economically viable LLM-driven CRS for SME settings. One major driver of these costs is the use of an advanced LLM as a ranker within the retrieval-augmented generation (RAG) technique. Our results additionally indicate that relying solely on approaches such as Prompt-based learning with ChatGPT as the underlying LLM makes it challenging to achieve satisfying quality in a production environment. Strategic considerations for SMEs deploying an LLM-driven CRS are outlined, particularly considering trade-offs in the current technical landscape.

Authors (4)

Hannes Kunstmann (1 paper)
Joseph Ollier (1 paper)
Joel Persson (5 papers)
Florian von Wangenheim (2 papers)

Summary

The paper presents a ChatGPT-driven conversational recommender system leveraging prompt-based learning to suit SME resource constraints.
It employs a stage-based design and dual evaluation metrics, including objective performance data and an adapted ResQue model for user feedback.
Findings show 85.5% positive user ratings alongside cost and latency challenges, underscoring the need for further optimization in deployment.

Implementation and Evaluation of EventChat: A ChatGPT-Driven Conversational Recommender System in an SME Context

The paper "EventChat: Implementation and user-centric evaluation of a LLM-driven conversational recommender system for exploring leisure events in an SME context" by Kunstmann et al. presents a comprehensive analysis on designing and evaluating a ChatGPT-driven conversational recommender system (CRS) tailored for small to medium-sized enterprises (SMEs). The investigation is twofold: detailing the system architecture and design decisions, and assessing performance via both objective metrics and subjective user evaluations. This approach critically examines the viability and strategic value of deploying LLM-driven CRS in a resource-constrained business context.

System Design Choices and Architecture

The system design of EventChat focuses on leveraging ChatGPT, utilizing prompt-based learning to mitigate the need for extensive training datasets, making it accessible for SMEs. Three pivotal decisions encapsulate the strategic and technical balancing:

Use of ChatGPT as the underlying LLM: This choice capitalized on ChatGPT's advanced capabilities in NLP without necessitating fine-tuning, which would be infeasible for a small enterprise due to cost and data constraints.
Prompt-based learning: This approach facilitated straightforward integration but revealed limitations in quality of user interactions, observed through issues like hallucinations and context misinterpretation.
Attribute-based question-answering CRS: This configuration reduced the need for anthropomorphic interaction, focusing on efficiency and cost reduction over expansive engagement strategies.

The system architecture integrates a stage-based approach encompassing action detection, search, recommendation, and targeted inquiries. The backend utilizes a turn-based dialog system, calling on a variety of external resources. A critical trade-off was noted in the architecture and prompt design between response quality and cost-efficiency.

Evaluation Methodology

The evaluation employed dual metrics: subjective user feedback, collected using a revised ResQue model tailored for LLM-driven CRS, and objective performance metrics like latency and token utilization. The ResQue model incorporated newly pertinent factors such as Input Processing Performance and Consistency to effectively gauge the user experience in this interactive context.

Key Evaluation Steps:

Survey Design: A streamlined survey based on the ResQue model, emphasizing simplicity to reduce participant burden and ensure usability across various devices.
Objective Metrics: Measurements of system latency, computational cost (token usage), and systematic logging of interaction data to analyze efficiency and detect performance issues.

Results and Findings

Subjective User Evaluations

A notable 85.5% of users rated the recommendation accuracy favorably. However, the path analysis within the structural equation model (SEM) illuminated critical predictors of user satisfaction and system utility. Consistency and input processing performance emerged as significant factors influencing user beliefs about system usefulness and control, subsequently affecting user confidence and future use intent.

Objective Performance Metrics

The median interaction cost of 4 cents per message and latency of 5.7 seconds underscores significant challenges for SMEs aiming to implement LLM-driven CRS. The ranking phase, particularly token consumption during candidate reduction, drove primary cost considerations. These performance metrics highlight pivotal areas requiring optimization for practical business applications.

Discussion

Architectural Implications: The stage-based approach mitigated excessive costs but did not fully resolve latency issues, emphasizing the need for balancing architectural complexity with practical constraints in SME contexts.

Technological Feasibility: Prompt-based learning, while reducing entry barriers, experienced significant quality issues, affirming the necessity for weighing alternative approaches like fine-tuning or using advanced LLMs with fewer parameters, albeit with added complexity and cost.

Managerial Insights: For SMEs, the adoption of LLM-driven CRS necessitates a comprehensive cost-benefit analysis, considering operational costs against expected user benefits and strategic advantages. The inherent trade-offs between implementation simplicity and performance underscore critical decision points for technology adoption in resource-constrained settings.

Theoretical Contributions

The adapted ResQue model effectively encapsulated dimensions pertinent to LLM-driven CRS, validating the inclusion of conversational quality metrics like Consistency and Input Processing Performance. This aggregate approach, combining subjective and objective assessments, provides a robust methodological framework for future CRS evaluation in real-world applications, enhancing replicability and cross-context comparison.

Conclusion

EventChat demonstrates the feasibility and challenges of employing ChatGPT-driven CRS in SMEs. Despite encountering critical trade-offs between cost, latency, and system quality, user evaluations indicate positive reception, suggesting potential for broader deployment with optimized implementations. Future research directions include exploring fine-tuned models, strategic adoption frameworks, and extending the ResQue model to diverse contexts, ensuring robust evaluation practices in this evolving technological landscape.

PDF Markdown

Related Papers

Tweets

https://twitter.com/RecsysPapers/status/1813347892071116905

https://twitter.com/RecsysPapers/status/1816911370186817545