Mental-LLM: Leveraging Large Language Models for Mental Health Prediction via Online Text Data

Published 26 Jul 2023 in cs.CL | (2307.14385v4)

Abstract: Advances in LLMs have empowered a variety of applications. However, there is still a significant gap in research when it comes to understanding and enhancing the capabilities of LLMs in the field of mental health. In this work, we present a comprehensive evaluation of multiple LLMs on various mental health prediction tasks via online text data, including Alpaca, Alpaca-LoRA, FLAN-T5, GPT-3.5, and GPT-4. We conduct a broad range of experiments, covering zero-shot prompting, few-shot prompting, and instruction fine-tuning. The results indicate a promising yet limited performance of LLMs with zero-shot and few-shot prompt designs for mental health tasks. More importantly, our experiments show that instruction finetuning can significantly boost the performance of LLMs for all tasks simultaneously. Our best-finetuned models, Mental-Alpaca and Mental-FLAN-T5, outperform the best prompt design of GPT-3.5 (25 and 15 times bigger) by 10.9% on balanced accuracy and the best of GPT-4 (250 and 150 times bigger) by 4.8%. They further perform on par with the state-of-the-art task-specific LLM. We also conduct an exploratory case study on LLMs' capability on mental health reasoning tasks, illustrating the promising capability of certain models such as GPT-4. We summarize our findings into a set of action guidelines for potential methods to enhance LLMs' capability for mental health tasks. Meanwhile, we also emphasize the important limitations before achieving deployability in real-world mental health settings, such as known racial and gender bias. We highlight the important ethical risks accompanying this line of research.

Abstract PDF Upgrade to Chat

Authors (9)

Citations (41)

View on Semantic Scholar

Summary

The paper demonstrates that instruction finetuning significantly boosts balanced accuracy, enabling smaller models to outperform larger ones on mental health prediction tasks.
The study evaluates zero-shot, few-shot, and instruction finetuning approaches across multiple datasets, with few-shot prompting yielding an average 4.1% improvement and finetuning up to 23.4% gains.
The research offers actionable guidelines for deploying LLMs in mental health contexts while addressing ethical concerns, data efficiency, and model generalizability.

The paper "Mental-LLM: Leveraging LLMs for Mental Health Prediction via Online Text Data" (2307.14385) explores the capability of LLMs for mental health prediction tasks using online text data and investigates methods to improve their performance in this domain. Given the significant burden of mental health issues globally and the potential of natural language analysis to provide insights, the authors aim to evaluate and enhance general-purpose LLMs for mental-health-related applications, moving beyond traditional task-specific NLP models.

The research addresses the question of how to improve LLMs' capability for mental health tasks. The authors evaluate several LLMs, including open-source models like Alpaca (7B), Alpaca-LoRA (7B), FLAN-T5 (11B), LLaMA2 (70B), and closed-source models like GPT-3.5 (175B) and GPT-4 (1700B). They conduct experiments using three primary approaches: zero-shot prompting, few-shot prompting, and instruction finetuning.

Methods:

Zero-shot Prompting: This approach evaluates the LLMs' inherent ability to perform mental health tasks without domain-specific training data, relying solely on carefully crafted prompts. The prompt template includes the user's text data, a specification part ( $Prompt_{Part1-S}$ ), a question part ( $Prompt_{Part2-Q}$ ), and output constraints. Different strategies for $Prompt_{Part1-S}$ are explored: Basic (no enhancement), Context Enhancement (mentioning social media context), Mental Health Enhancement (asking the model to act as a psychologist), and a combination of Context and Mental Health Enhancement. $Prompt_{Part2-Q}$ is tailored to specific tasks, covering mental state prediction (stress, depression) and critical risk action prediction (suicide), for both binary and multi-class classification.
Few-shot Prompting: This method adds a few examples of input-label pairs to the prompt, allowing the model to perform in-context learning without updating its parameters. The few-shot prompt consists of multiple sample zero-shot prompts with their corresponding labels followed by the actual input prompt. The number of examples is limited by the model's maximum input token length.
Instruction Finetuning: This approach involves updating the model's parameters by training it on mental health datasets using instruction-formatted data.
- Single-dataset Finetuning: Standard finetuning on a single mental health dataset to evaluate performance on that dataset and generalization to others.
- Multi-dataset Finetuning: Leveraging instruction finetuning to train the LLM on multiple mental health datasets covering diverse tasks simultaneously. This method uses the same prompt format as zero-shot/few-shot for training examples. The authors specifically train Mental-Alpaca and Mental-FLAN-T5 using this method on a combined dataset of six tasks from four different sources.

Datasets and Tasks:

The study uses seven publicly available datasets containing online text data with human annotations, primarily from Reddit, but also including Twitter and SMS-like text.

Training/Testing Datasets (Reddit-based):
- Dreaddit [turcan_dreaddit_2019]: Binary stress prediction (post-level).
- DepSeverity [naseem_early_2022]: Binary depression prediction (post-level), Four-level depression prediction (post-level).
- SDCNL [haque_deep_2021]: Binary suicide ideation prediction (post-level).
- CSSRS-Suicide [gaur_knowledge-aware_2019]: Binary suicide risk prediction (user-level), Five-level suicide risk prediction (user-level).
- These datasets have human expert annotations and are split 80/20 for training/testing, ensuring no user overlap between splits.
External Evaluation Datasets:
- Red-Sam [kalinathan_data_2022, kayalvizhi2022findings]: Binary depression prediction (post-level, Reddit, not used in finetuning).
- Twt-60Users [jamil_monitoring_2017]: Binary depression prediction (post-level, Twitter).
- SAD [mauriello_sad_2021]: Binary stress prediction (post-level, SMS-like).
- These datasets are used to evaluate the generalizability of the models, particularly the finetuned ones, to different data sources and platforms.

Results:

The primary evaluation metric is Balanced Accuracy, chosen for its robustness to class imbalance.

Zero-shot Prompting: Shows promising but limited performance compared to task-specific baselines (BERT, Mental-RoBERTa). FLAN-T5 and GPT-4 generally outperform smaller models and GPT-3.5 in the zero-shot setting, with FLAN-T5 even approaching or surpassing SOTA baselines on some tasks despite being much smaller than GPT-3.5/GPT-4. The effectiveness of prompt enhancement strategies varies: context enhancement generally helps most models, while mental health enhancement is less consistent, sometimes harming performance in Alpaca-LoRA and FLAN-T5. Larger, dialogue-focused models (LLaMA2, GPT-3.5, GPT-4 compared to Alpaca-LoRA) benefit more from prompt enhancements.
Few-shot Prompting: Provides a performance improvement over zero-shot (average 4.1% balanced accuracy increase), particularly for smaller models like Alpaca and FLAN-T5. Larger models like GPT-3.5 and GPT-4 show less significant improvement, potentially due to the challenge of integrating new examples with their vast pre-existing knowledge. Due to token limits, few-shot was tested on a subset of tasks.
Instruction Finetuning: This approach yields the most significant performance boost. Mental-Alpaca and Mental-FLAN-T5, finetuned on multiple datasets/tasks, achieve substantially higher balanced accuracy compared to their zero-shot and few-shot counterparts (average 23.4% and 14.7% improvement over zero-shot, respectively). Crucially, Mental-Alpaca and Mental-FLAN-T5 outperform the zero/few-shot best performance of GPT-3.5 (10.1% and 11.6% average improvement) and GPT-4 (4.0% and 5.5% average improvement) on most tasks, despite being significantly smaller models. They perform on par with the state-of-the-art task-specific Mental-RoBERTa across multiple tasks simultaneously, demonstrating multi-task capability without task-specific retraining.
- Dialogue vs. Task-Solving Models: While FLAN-T5 (task-solving) is strong in zero-shot, Alpaca (dialogue-focused) shows greater improvement after finetuning, suggesting dialogue models may be better at learning from human-generated text when sufficient data is available.
- Generalizability: Finetuning on multiple datasets significantly enhances generalizability, enabling Mental-Alpaca and Mental-FLAN-T5 to perform well on external datasets from different platforms (Reddit, Twitter, SMS-like) not included in finetuning. Single-dataset finetuning provides less stable generalization.
- Data Efficiency: Finetuning requires relatively small amounts of data. Using only 1% of the training data (a few hundred samples across tasks) is often sufficient to outperform zero-shot, and performance approaches saturation after 10% of the data. Prioritizing data variation (more tasks/datasets) over quantity within a single dataset is more effective when overall data is limited.
Case Study (Reasoning): An exploratory case study using Chain-of-Thought prompting reveals that larger models like GPT-3.5 and especially GPT-4 exhibit impressive mental health reasoning capabilities, providing insightful analysis. Alpaca shows moderate ability, while FLAN-T5's reasoning is superficial. However, finetuning only on classification tasks can eliminate the reasoning ability in models like Mental-Alpaca and Mental-FLAN-T5. The study also highlights instances of incorrect or problematic reasoning from all models, including false positive predictions based on misinterpretation of context and explanations that sound logical but are based on flawed inferences, underscoring significant limitations and safety risks.

Discussion and Guidelines for Implementation:

The authors derive practical guidelines for empowering LLMs for mental health tasks:

With limited computing resources (inference or API only), combine prompt design and few-shot prompting. Context enhancement is consistently beneficial. Mental health enhancement helps larger models.
With sufficient computing resources (finetuning possible), instruction finetune models on various mental health datasets. Dialogue-based models like Alpaca may learn better from human text data than task-solving models like FLAN-T5 after finetuning.
Efficient finetuning is possible with a few hundred samples across multiple datasets. Prioritize data variation over size in a single dataset for better generalization.
To enable reasoning, more curated finetuning datasets specifically designed for mental health reasoning and causality are needed.
Current LLMs struggle with complex mental health contexts, often misled by superficial text or hypothetical scenarios, leading to incorrect predictions and potentially harmful explanations.

Ethical Considerations and Deployability Gaps:

The authors strongly emphasize that the promising technical results do not equate to real-world deployability. Significant ethical concerns and gaps must be addressed, including:

Known biases (racial, gender) present in LLMs trained on human data.
Potential biases in the human-annotated datasets themselves (e.g., stereotypes, confirmation bias).
The risk of incorrect and misleading reasoning from LLMs, which could have negative consequences in a mental health context.
Privacy concerns related to handling sensitive mental health data from online platforms. The authors stress the need for careful development, auditing, and regulation to mitigate these risks before deployment.

Limitations:

The study acknowledges limitations, including the limited range of datasets and LLMs evaluated, the non-comprehensive exploration of prompt designs, constraints on few-shot examples due to token limits, the lack of systematic evaluation of reasoning capabilities, the potential influence of Reddit-based data on pre-trained models, and the lack of fairness evaluation across demographic groups.

In conclusion, the paper demonstrates that instruction finetuning is a highly effective method for boosting LLMs' performance on multiple mental health prediction tasks simultaneously, enabling smaller finetuned models to outperform larger general-purpose models in zero/few-shot settings and match SOTA task-specific models. While LLMs show promising reasoning capabilities, significant ethical challenges, biases, and limitations in handling complex contexts must be addressed before they can be safely deployed in real-world mental health applications.

Markdown Report Issue