AI-Augmented Surveys: Leveraging Large Language Models and Surveys for Opinion Prediction (2305.09620v3)

Published 16 May 2023 in cs.CL, cs.AI, and cs.LG

Abstract: LLMs that produce human-like responses have begun to revolutionize research practices in the social sciences. We develop a novel methodological framework that fine-tunes LLMs with repeated cross-sectional surveys to incorporate the meaning of survey questions, individual beliefs, and temporal contexts for opinion prediction. We introduce two new emerging applications of the AI-augmented survey: retrodiction (i.e., predict year-level missing responses) and unasked opinion prediction (i.e., predict entirely missing responses). Among 3,110 binarized opinions from 68,846 Americans in the General Social Survey from 1972 to 2021, our models based on Alpaca-7b excel in retrodiction (AUC = 0.86 for personal opinion prediction, $\rho$ = 0.98 for public opinion prediction). These remarkable prediction capabilities allow us to fill in missing trends with high confidence and pinpoint when public attitudes changed, such as the rising support for same-sex marriage. On the other hand, our fine-tuned Alpaca-7b models show modest success in unasked opinion prediction (AUC = 0.73, $\rho$ = 0.67). We discuss practical constraints and ethical concerns regarding individual autonomy and privacy when using LLMs for opinion prediction. Our study demonstrates that LLMs and surveys can mutually enhance each other's capabilities: LLMs can broaden survey potential, while surveys can improve the alignment of LLMs.

References (84)

Citations (18)

View on Semantic Scholar

Summary

The paper presents a novel framework that fine-tunes large language models with survey data to impute missing responses, retrodict historical trends, and predict unasked opinions.
The methodology integrates semantic, individual belief, and period embeddings via a Deep Cross Network to capture higher-order interactions in opinion data.
Evaluation on the General Social Survey shows robust performance with AUC metrics up to 0.866, demonstrating the framework’s practical and empirical validity.

Introduction and Motivation

The paper "AI-Augmented Surveys: Leveraging LLMs and Surveys for Opinion Prediction" (2305.09620) addresses the longstanding challenge of predicting public opinion trends and individual attitudes in the social sciences. Traditional survey research, exemplified by the General Social Survey (GSS), is limited by cost, respondent fatigue, and the inability to ask all relevant questions across all periods. Meanwhile, digital trace data (e.g., social media) offer scale but lack representativeness and ground-truth validation. The authors propose a methodological framework that fine-tunes LLMs with repeated cross-sectional survey data, enabling the prediction of missing, retrodicted, and entirely unasked opinions at both the individual and aggregate levels.

Problem Formulation: Types of Missingness in Survey Data

The paper formalizes three core prediction tasks in survey research, each corresponding to a distinct missing data scenario:

Missing Data Imputation: Predicting skipped or non-responded items within existing survey waves.
Retrodiction: Predicting responses to questions in years when they were not fielded, enabling reconstruction of historical opinion trends.
Unasked Opinion Prediction: Predicting responses to questions never asked in the survey, effectively extrapolating to new variables.
Figure 1: Three types of missing data challenges in survey research, illustrating response-level, year-level, and variable-level missingness.

Methodological Framework: Personalized LLMs with Embedding Interactions

The core innovation is a model architecture that integrates three types of embeddings:

Semantic Embedding: Encodes the meaning of survey questions using sentence-level representations from pre-trained LLMs (e.g., Alpaca-7b).
Individual Belief Embedding: Learns latent representations for each respondent, capturing their unique belief system.
Period Embedding: Encodes temporal context, allowing the model to account for historical shifts in meaning and opinion structure.

These embeddings are combined via a Deep Cross Network (DCN) to capture higher-order interactions, enabling the model to predict the probability of a positive response for any (individual, question, year) tuple.

Figure 2: Overview of the methodological framework, showing the aggregation of individual-level predictions to population-level estimates and the joint optimization of semantic, belief, and period embeddings.

Data and Implementation

The model is fine-tuned on the GSS, comprising 68,846 respondents and 3,110 binarized opinion variables across 33 survey waves (1972–2021). Survey questions are binarized using a combination of SentenceBERT-based semantic similarity and manual coding. The architecture is model-agnostic, supporting both decoder-only (Alpaca-7b, GPT-J-6b) and encoder-only (RoBERTa-large) LLMs. Fine-tuning is performed using TensorFlow Recommenders and Huggingface APIs, with the LLM parameters frozen except for a projection layer to mitigate overfitting.

Model Evaluation and Performance

Evaluation is conducted via 10-fold cross-validation, with distinct schemes for each missingness scenario. The primary metric is AUC (Area Under the ROC Curve), with additional reporting of accuracy and F1-score. Matrix factorization serves as a benchmark for imputation and retrodiction tasks.

Figure 3: Model performance for predicting missing responses at individual and aggregate levels, including ROC curves and the relationship between observed and predicted agreement rates.

Key Results

Missing Data Imputation: Alpaca-7b achieves AUC = 0.866, comparable to matrix factorization (AUC = 0.852).
Retrodiction: Alpaca-7b achieves AUC = 0.860, outperforming matrix factorization (AUC = 0.798). Correlation between predicted and observed aggregate opinions exceeds 0.98.
Unasked Opinion Prediction: Alpaca-7b achieves AUC = 0.729, with a lower aggregate correlation ( $\rho$ = 0.68), indicating the increased difficulty of this task.

The model's performance is robust across missing data mechanisms (MCAR, MAR, MNAR) and degrades gracefully as the proportion of missing data increases.

Retrodicting Counterfactual Trends

The retrodiction capability enables the reconstruction of historical opinion trends for questions introduced late or discontinued early in the GSS. For example, the model accurately reconstructs the rise in support for same-sex marriage prior to the question's introduction in 2008, and predicts stable or shifting trends for issues such as busing and vegetarianism.

Figure 4: Counterfactual trend prediction for selected GSS questions, comparing model-based retrodictions to matrix factorization and observed data.

Heterogeneity in Predictability

Analysis of individual- and opinion-level AUCs reveals systematic heterogeneity:

Higher SES (education, income) and strong partisanship are associated with greater predictability.
Racial minorities and earlier periods (1970s) exhibit lower predictability.
Opinions highly correlated with political ideology are more predictable; controversial or weakly structured opinions are less so.
Figure 5: Coefficient plots showing subgroup differences in individual-level AUC across missing data scenarios.

Figure 6: Coefficient plots showing opinion-level predictors of AUC, including period, sample size, response variance, and ideological correlation.

Model Architecture and Embedding Visualization

The embedding spaces learned by the model reflect meaningful structure:

Semantic embeddings cluster questions by topic.
Belief embeddings cluster individuals by latent belief systems.
Period embeddings capture temporal proximity and historical shifts.
Figure 7: t-SNE visualizations of semantic, belief, and period embeddings, colored by topic, individual cluster, and year, respectively.

Implementation Considerations

Computational Requirements: Fine-tuning Alpaca-7b on the full GSS requires substantial GPU resources, but freezing LLM parameters and optimizing only projection and embedding layers reduces memory and compute demands.
Data Requirements: Model performance saturates with ~100 questions per respondent; performance is robust even with high rates of missingness.
Generalizability: The approach is model-agnostic and can be extended to other surveys and languages, though cross-cultural generalization remains an open question.
Ethical Considerations: The ability to predict unexpressed or unasked opinions raises privacy and autonomy concerns, especially for marginalized groups with lower predictability.

Theoretical and Practical Implications

The results demonstrate that fine-tuned LLMs can substantially augment traditional survey research by:

Enabling high-fidelity imputation and retrodiction, thus maximizing the utility of sparse or incomplete survey data.
Providing a principled method for counterfactual trend estimation, critical for historical and policy analysis.
Revealing the social and structural determinants of opinion predictability, with implications for theories of belief systems and cultural coherence.

However, the modest performance in unasked opinion prediction underscores the continued necessity of human-generated survey data for capturing the full heterogeneity of public opinion.

Future Directions

Multi-class and Ordinal Prediction: Extending the framework to handle non-binary response options via multi-class classification or ordinal regression.
Cross-Survey and Cross-Cultural Validation: Testing the transferability of fine-tuned models across different survey instruments and national contexts.
Dynamic and Adaptive Survey Design: Leveraging model uncertainty to optimize question selection and respondent sampling in real time.
Privacy-Preserving Modeling: Developing techniques to mitigate privacy risks and ensure ethical deployment in applied settings.

Conclusion

This work establishes a scalable, flexible, and empirically validated framework for AI-augmented survey research. By integrating LLMs with representative survey data and modeling individual, semantic, and temporal heterogeneity, the approach enables accurate prediction of both observed and unobserved opinions. The findings have significant implications for the design, analysis, and interpretation of social surveys, as well as for the broader integration of AI in the social sciences. The framework's limitations—particularly in predicting entirely unasked opinions and in representing minority groups—highlight the ongoing need for methodological innovation and ethical vigilance as AI becomes increasingly embedded in empirical social research.