Towards Probing Contact Center Large Language Models (2312.15922v1)

Published 26 Dec 2023 in cs.CL

Abstract: Fine-tuning LLMs with domain-specific instructions has emerged as an effective method to enhance their domain-specific understanding. Yet, there is limited work that examines the core characteristics acquired during this process. In this study, we benchmark the fundamental characteristics learned by contact-center (CC) specific instruction fine-tuned LLMs with out-of-the-box (OOB) LLMs via probing tasks encompassing conversational, channel, and automatic speech recognition (ASR) properties. We explore different LLM architectures (Flan-T5 and Llama), sizes (3B, 7B, 11B, 13B), and fine-tuning paradigms (full fine-tuning vs PEFT). Our findings reveal remarkable effectiveness of CC-LLMs on the in-domain downstream tasks, with improvement in response acceptability by over 48% compared to OOB-LLMs. Additionally, we compare the performance of OOB-LLMs and CC-LLMs on the widely used SentEval dataset, and assess their capabilities in terms of surface, syntactic, and semantic information through probing tasks. Intriguingly, we note a relatively consistent performance of probing classifiers on the set of probing tasks. Our observations indicate that CC-LLMs, while outperforming their out-of-the-box counterparts, exhibit a tendency to rely less on encoding surface, syntactic, and semantic properties, highlighting the intricate interplay between domain-specific adaptation and probing task performance opening up opportunities to explore behavior of fine-tuned LLMs in specialized contexts.

References (25)

Summary

The paper demonstrates that fine-tuning LLMs with contact center-specific data, including LoRA, significantly enhances task performance.
Experiments show a 48% increase in response acceptability and evidence that core linguistic abilities are largely preserved post fine-tuning.
A comparative analysis of model architectures highlights shifts in encoding surface-level versus deep semantic properties, guiding future domain adaptations.

Towards Probing Contact Center LLMs

Introduction

The evolution of LLMs has reached a stage where fine-tuning on domain-specific data has become a focal point of investigation. The paper "Towards Probing Contact Center LLMs" addresses the unique challenges and properties associated with fine-tuning LLMs for the contact center (CC) industry. This industry, characterized by domain-specific jargon, conversational dynamics, and etiquette, presents a fertile ground for deploying specialized LLMs to enhance customer interactions.

Methodology

The authors evaluate the impact of instruction fine-tuning LLMs using data specific to contact center interactions. Models such as Flan-T5 and Llama, with parameter counts varying from 3B to 13B, undergo fine-tuning using both traditional full fine-tuning and parameter-efficient techniques like Low-Rank Adaptation (LoRA). This paper applies several probing tasks to assess the LLMs' understanding of conversational and automatic speech recognition (ASR) properties.

Key Findings

The results underscore a substantial improvement in task performance of CC-specific LLMs (CC-LLMs) over their out-of-the-box (OOB) counterparts on downstream CC tasks. This enhancement is quantitatively reflected in a 48% increase in response acceptability.

Figure 1: Benchmarking quality of responses generated by CC LLMs versus OOB LLMs on downstream tasks in the contact-center domain.

Probationary assessments using the SentEval suite reveal a persistent encoding of linguistic properties in CC-LLMs, dispelling concerns that domain fine-tuning might dilute a model's linguistic abilities. However, CC-LLMs tend less to encode surface-level or semantic attributes, suggesting a shift in learned properties due to domain specificity.

Experiment Design

The paper designs experiments to dissect the contributions of model architectures, sizes, and fine-tuning paradigms to the performance of the LLMs. The authors juxtapose OOB LLMs and finely tuned CC-LLMs across a variety of probing tasks, ensuring a comprehensive performance evaluation using metrics such as the Macro F1 score.

Implementation Details

The implementation leverages extensive datasets comprising ASR transcripts from diverse sectors, integrating natural language instructions. The linear Multi-Layer Perceptron (MLP) probing mechanism extracts hidden state representations to evaluate linguistic knowledge embedding. This approach employs an AWS computational environment for resource-intensive model processing.

Discussion

This paper opens perspectives into the strategic design of domain-specific LMs in the CC sector. Despite the consistent linguistic capacity retained post fine-tuning, the inclination of CC-LLMs towards encoding unique CC properties hints at potential avenues to refine domain-centric adaptability further. The comparative analysis between encoder-only and decoder-only architectures provides foundational insights into architecture-specific performance on conversational tasks.

Conclusion

Probing contact center LLMs uncovers not only improvements in task-aligned performance but also introspective facets into the linguistic representations embedded within such tailored models. The research reveals that while fine-tuning enhances CC-LLM performance, understanding the subtleties of the learned properties remains crucial for future advancements. As AI continues to integrate more deeply with specialized industries, these insights will guide future explorations into the scalable application of LLMs in specialized domains, including but not limited to the contact center terrain.