Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 188 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 39 tok/s Pro
GPT-5 High 39 tok/s Pro
GPT-4o 78 tok/s Pro
Kimi K2 207 tok/s Pro
GPT OSS 120B 446 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

AI-Augmented Surveys: Leveraging Large Language Models and Surveys for Opinion Prediction (2305.09620v3)

Published 16 May 2023 in cs.CL, cs.AI, and cs.LG

Abstract: LLMs that produce human-like responses have begun to revolutionize research practices in the social sciences. We develop a novel methodological framework that fine-tunes LLMs with repeated cross-sectional surveys to incorporate the meaning of survey questions, individual beliefs, and temporal contexts for opinion prediction. We introduce two new emerging applications of the AI-augmented survey: retrodiction (i.e., predict year-level missing responses) and unasked opinion prediction (i.e., predict entirely missing responses). Among 3,110 binarized opinions from 68,846 Americans in the General Social Survey from 1972 to 2021, our models based on Alpaca-7b excel in retrodiction (AUC = 0.86 for personal opinion prediction, $\rho$ = 0.98 for public opinion prediction). These remarkable prediction capabilities allow us to fill in missing trends with high confidence and pinpoint when public attitudes changed, such as the rising support for same-sex marriage. On the other hand, our fine-tuned Alpaca-7b models show modest success in unasked opinion prediction (AUC = 0.73, $\rho$ = 0.67). We discuss practical constraints and ethical concerns regarding individual autonomy and privacy when using LLMs for opinion prediction. Our study demonstrates that LLMs and surveys can mutually enhance each other's capabilities: LLMs can broaden survey potential, while surveys can improve the alignment of LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (84)
  1. “Persistent Anti-Muslim Bias in Large Language Models.” In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, pp. 298–306. Association for Computing Machinery.
  2. “Using Large Language Models to Simulate Multiple Humans and Replicate Human Subject Studies. arXiv:2208.10264 [cs.CL].”
  3. “Can We Trust the Evaluation on ChatGPT? arXiv:2303.12767 [cs.CL].”
  4. “The Strength of Issues: Using Multiple Measures to Gauge Preference Stability, Ideological Constraint, and Issue Voting.” The American Political Science Review 102:215–232.
  5. “Out of One, Many: Using Language Models to Simulate Human Samples.” Political Analysis https://doi.org/10.1017/pan.2023.2.
  6. Bail, Christopher A. 2023. “Can Generative AI Improve Social Science? SocArXiv.” https://doi.org/10.31235/osf.io/rwtzs.
  7. Baldassarri, Delia and Andrew Gelman. 2008. “Partisans without Constraint: Political Polarization and Trends in American Public Opinion.” American Journal of Sociology 114:408–446.
  8. Baldassarri, Delia and Amir Goldberg. 2014. “Neither Ideologues nor Agnostics: Alternative Voters’ Belief System in an Age of Partisan Politics.” American Journal of Sociology 120:45–95.
  9. Baldassarri, Delia and Barum Park. 2020. “Was There a Culture War? Partisan Polarization and Secular Trends in US Public Opinion.” The Journal of Politics 82:809–827.
  10. Baunach, Dawn Michelle. 2012. “Changing Same-Sex Marriage Attitudes in America from 1988 Through 2010.” Public Opinion Quarterly 76:364–378.
  11. Beauchamp, Nicholas. 2017. “Predicting and Interpolating State-Level Polls Using Twitter Textual Data.” American Journal of Political Science 61:490–503.
  12. Behr, Roy L. and Shanto Iyengar. 1985. “Television News, Real-World Cues, and Changes in the Public Agenda.” Public Opinion Quarterly 49:38.
  13. Berinsky, Adam J. 2017. “Measuring Public Opinion with Surveys.” Annual Review of Political Science 20:309–329.
  14. “Predicting Poverty and Wealth from Mobile Phone Metadata.” Science 350:1073–1076.
  15. Boutyline, Andrei and Stephen Vaisey. 2017. “Belief Network Analysis: A Relational Approach to Understanding the Structure of Attitudes.” American journal of sociology 122:1371–1447.
  16. “Using gpt for market research.” Available at SSRN 4395751 .
  17. Brayne, Sarah. 2020. Predict and Surveil: Data, Discretion, and the Future of Policing. Oxford University Press.
  18. Brooks, Clem and Jeff Manza. 2006. “Social Policy Responsiveness in Developed Democracies.” American Sociological Review 71:474–494.
  19. “Language Models Are Few-Shot Learners.” Advances in neural information processing systems 33:1877–1901.
  20. Burstein, Paul. 2003. “The Impact of Public Opinion on Public Policy: A Review and an Agenda.” Political Research Quarterly 56:29–40.
  21. “Promises and Pitfalls of Using Digital Traces for Demographic Research.” Demography 55:1979–1999.
  22. “Language Models Trained on Media Diets Can Predict Public Opinion. arXiv:2303.16779 [cs.CL].”
  23. Converse, P. 1964. “The Nature of Belief Systems in Mass Publics.” In Ideology and Discontent, edited by Apter, D. E., pp. 206–261. The Free Press.
  24. Couper, Mick P. 2017. “New Developments in Survey Data Collection.” Annual Review of Sociology 43:121–145.
  25. ‘‘General Social Surveys, 1972-2021 Cross-section [machine-readable data file, 68,846 cases]. Principal Investigator, Michael Davern; Co-Principal Investigators, Rene Bautista, Jeremy Freese, Stephen L. Morgan, and Tom W. Smith; Sponsored by National Science Foundation. – NORC ed. – Chicago: NORC, 2021: NORC at the University of Chicago [producer and distributor]. Data accessed from the GSS Data Explorer website at gssdataexplorer.norc.org.”
  26. “Why Do Liberals Drink Lattes?” American Journal of Sociology 120:1473–1511.
  27. “Can AI Language Models Replace Human Participants?” Trends in Cognitive Sciences https://doi.org/10.1016/j.tics.2023.04.008.
  28. “Digital Inequality: From Unequal Access to Differentiated Use.” In Social Inequality, edited by Kathryn M. Neckerman, pp. 355–400. Russell Sage Foundation.
  29. Downs, Anthony. 1972. “Up and down with Ecology: The Issue-Attention Cycle.” The public 28:38–50.
  30. “GPTs Are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models. arXiv:2303.10130 [econ.GN].”
  31. Ferraro, Kenneth F. and Melissa M. Farmer. 1999. “Utility of Health Data from Social Surveys: Is There a Gold Standard for Measuring Morbidity?” American Sociological Review 64:303–315.
  32. “AI4People—An Ethical Framework for a Good AI Society: Opportunities, Risks, Principles, and Recommendations.” Minds and Machines 28:689–707.
  33. Goldberg, Amir. 2011. “Mapping Shared Understandings Using Relational Class Analysis: The Case of the Cultural Omnivore Reexamined.” American Journal of Sociology 116:1397–1436.
  34. “Yes but.. Can ChatGPT Identify Entities in Historical Documents? arXiv:2303.17322 [cs.DL].”
  35. “Jury Learning: Integrating Dissenting Voices into Machine Learning Models.” In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, pp. 1–19. Association for Computing Machinery.
  36. Text as Data: A New Framework for Machine Learning and the Social Sciences. Princeton University Press.
  37. “AI and the transformation of social science research.” Science 380:1108–1109.
  38. “Evaluating Large Language Models in Generating Synthetic HCI Research Data: A Case Study.” In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, pp. 1–19. Association for Computing Machinery.
  39. Hastie, Trevor J. 1992. “Generalized Additive Models.” In Statistical Models in S. Routledge.
  40. Hilgartner, Stephen and Charles L Bosk. 1988. “The Rise and Fall of Social Problems: A Public Arenas Model.” American journal of Sociology 94:53–78.
  41. Holm, Elizabeth A. 2019. “In Defense of the Black Box.” Science 364:26–27.
  42. “Amelia II: A Program for Missing Data.” Journal of Statistical Software 45:1–47.
  43. Horton, John J. 2023. “Large Language Models as Simulated Economic Agents: What Can We Learn from Homo Silicus? arXiv:2301.07543 [econ.GN].”
  44. Igo, Sarah E. 2008. The Averaged American: Surveys, Citizens, and the Making of a Mass Public. Harvard University Press.
  45. Jefferson, Hakeem. 2020. “The Curious Case of Black Conservatives: Construct Validity and the 7-Point Liberal-Conservative Scale.” Available at SSRN: https://ssrn.com/abstract=3602209 or http://dx.doi.org/10.2139/ssrn.3602209.
  46. “CommunityLM: Probing Partisan Worldviews from Language Models.” In Proceedings of the 29th International Conference on Computational Linguistics, pp. 6818–6826. International Committee on Computational Linguistics.
  47. Joo, Won-Tak and Jason Fletcher. 2020. “Out of Sync, out of Society: Political Beliefs and Social Networks.” Network Science 8:445–468.
  48. “Highly Accurate Protein Structure Prediction with AlphaFold.” Nature 596:583–589.
  49. Jurafsky, Daniel and James Martin. 2023. Speech and Language Processing, 3rd Edition Draft.
  50. Kiley, Kevin and Stephen Vaisey. 2020. “Measuring Stability and Change in Personal Culture Using Panel Data.” American Sociological Review 85:477–506.
  51. ‘‘Personalisation within Bounds: A Risk Taxonomy and Policy Framework for the Alignment of Large Language Models with Personalised Feedback. arXiv:2303.05453 [cs.CL].”
  52. “Matrix Factorization Techniques for Recommender Systems.” Computer 42:30–37. Conference Name: Computer.
  53. “The Geometry of Culture: Analyzing the Meanings of Class through Word Embeddings.” American Sociological Review 84:905–949.
  54. Latour, Bruno. 2007. Reassembling the Social: An Introduction to Actor-Network-Theory. OUP Oxford.
  55. Lersch, Philipp M. 2023. “Change in Personal Culture over the Life Course.” American Sociological Review 88:220–251.
  56. “Roberta: A Robustly Optimized Bert Pretraining Approach. arXiv:1907.11692 [cs.CL].” .
  57. “A Pretrainer’s Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity. arXiv:2305.13169 [cs.CL].” .
  58. ‘‘Tracking US Social Change Over a Half-Century: The General Social Survey at Fifty.” Annual Review of Sociology 46:109–134.
  59. Martin, John Levi. 2010. “Life’s a Beach but You’re an Ant, and Other Unwelcome News for the Sociology of Culture.” Poetics 38:229–244.
  60. “Human-Level Play in the Game of Diplomacy by Combining Language Models with Strategic Reasoning.” Science 378:1067–1074.
  61. “Aligning Multidimensional Worldviews and Discovering Ideological Differences.” Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing pp. 4832–4845.
  62. “Rapidly Declining Remarkability of Temperature Anomalies May Obscure Public Perception of Climate Change.” Proceedings of the National Academy of Sciences 116:4905–4910.
  63. “StereoSet: Measuring Stereotypical Bias in Pretrained Language Models. arXiv:2004.09456 [cs.CL].”
  64. “From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series.” In Proceedings of the Fourth International Conference on Weblogs and Social Media, pp. 122–129. AAAI Press.
  65. “Detecting Community Sensitive Norm Violations in Online Conversations.” Findings of the Association for Computational Linguistics: EMNLP 2021 pp. 3386–3397.
  66. “Machine Behaviour.” Nature 568:477–486.
  67. Reimers, Nils and Iryna Gurevych. 2019. “Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks arXiv:1908.10084 [cs.CL].”
  68. Rubin, Donald B. 1976. “Inference and Missing Data.” Biometrika 63:581–592.
  69. “Lexical Shifts, Substantive Changes, and Continuity in State of the Union Discourse, 1790–2014.” Proceedings of the National Academy of Sciences 112:10837–10844.
  70. “Measuring the Predictability of Life Outcomes with a Scientific Mass Collaboration.” Proceedings of the National Academy of Sciences 117:8398–8403.
  71. ‘‘Whose Opinions Do Language Models Reflect? arXiv:2303.17548 [cs.CL].”
  72. “Large Pre-Trained Language Models Contain Human-like Biases of What Is Right and Wrong to Do.” Nature Machine Intelligence 4:258–268.
  73. “Sparse Data Reconstruction, Missing Value and Multiple Imputation through Matrix Factorization.” Sociological Methodology 53:72–114.
  74. Shapiro, Robert Y. 2011. “Public Opinion and American Democracy.” Public Opinion Quarterly 75:982–1017.
  75. “Model evaluation for extreme risks arXiv:2305.15324 [cs.AI].”
  76. Stephens-Davidowitz, Seth. 2017. Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are. Dey Street Books.
  77. “Alpaca: A Strong, Replicable Instruction-Following Model.” https://github.com/tatsu-lab/stanford_alpaca.
  78. van Buuren, Stef and Karin Groothuis-Oudshoorn. 2011. “Mice: Multivariate Imputation by Chained Equations in R.” Journal of Statistical Software 45:1–67.
  79. “Attention Is All You Need arXiv:1706.03762 [cs.CL].”
  80. “GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model.” https://github.com/kingoflolz/mesh-transformer-jax/.
  81. “DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-scale Learning to Rank Systems.” In Proceedings of the Web Conference 2021, pp. 1785–1797. Association for Computing Machinery.
  82. Yan, Ting. 2021. “Consequences of Asking Sensitive Questions in Surveys.” Annual Review of Statistics and Its Application 8:109–127.
  83. Zaller, John and Stanley Feldman. 1992. “A Simple Theory of the Survey Response: Answering Questions versus Revealing Preferences.” American Journal of Political Science 36:579–616.
  84. “Can Large Language Models Transform Computational Social Science? arXiv:2305.03514 [cs.CL].”
Citations (18)

Summary

  • The paper presents a novel framework that fine-tunes large language models with survey data to impute missing responses, retrodict historical trends, and predict unasked opinions.
  • The methodology integrates semantic, individual belief, and period embeddings via a Deep Cross Network to capture higher-order interactions in opinion data.
  • Evaluation on the General Social Survey shows robust performance with AUC metrics up to 0.866, demonstrating the framework’s practical and empirical validity.

AI-Augmented Surveys: Fine-Tuning LLMs for Opinion Prediction in Social Science

Introduction and Motivation

The paper "AI-Augmented Surveys: Leveraging LLMs and Surveys for Opinion Prediction" (2305.09620) addresses the longstanding challenge of predicting public opinion trends and individual attitudes in the social sciences. Traditional survey research, exemplified by the General Social Survey (GSS), is limited by cost, respondent fatigue, and the inability to ask all relevant questions across all periods. Meanwhile, digital trace data (e.g., social media) offer scale but lack representativeness and ground-truth validation. The authors propose a methodological framework that fine-tunes LLMs with repeated cross-sectional survey data, enabling the prediction of missing, retrodicted, and entirely unasked opinions at both the individual and aggregate levels.

Problem Formulation: Types of Missingness in Survey Data

The paper formalizes three core prediction tasks in survey research, each corresponding to a distinct missing data scenario:

  1. Missing Data Imputation: Predicting skipped or non-responded items within existing survey waves.
  2. Retrodiction: Predicting responses to questions in years when they were not fielded, enabling reconstruction of historical opinion trends.
  3. Unasked Opinion Prediction: Predicting responses to questions never asked in the survey, effectively extrapolating to new variables. Figure 1

    Figure 1: Three types of missing data challenges in survey research, illustrating response-level, year-level, and variable-level missingness.

Methodological Framework: Personalized LLMs with Embedding Interactions

The core innovation is a model architecture that integrates three types of embeddings:

  • Semantic Embedding: Encodes the meaning of survey questions using sentence-level representations from pre-trained LLMs (e.g., Alpaca-7b).
  • Individual Belief Embedding: Learns latent representations for each respondent, capturing their unique belief system.
  • Period Embedding: Encodes temporal context, allowing the model to account for historical shifts in meaning and opinion structure.

These embeddings are combined via a Deep Cross Network (DCN) to capture higher-order interactions, enabling the model to predict the probability of a positive response for any (individual, question, year) tuple. Figure 2

Figure 2: Overview of the methodological framework, showing the aggregation of individual-level predictions to population-level estimates and the joint optimization of semantic, belief, and period embeddings.

Data and Implementation

The model is fine-tuned on the GSS, comprising 68,846 respondents and 3,110 binarized opinion variables across 33 survey waves (1972–2021). Survey questions are binarized using a combination of SentenceBERT-based semantic similarity and manual coding. The architecture is model-agnostic, supporting both decoder-only (Alpaca-7b, GPT-J-6b) and encoder-only (RoBERTa-large) LLMs. Fine-tuning is performed using TensorFlow Recommenders and Huggingface APIs, with the LLM parameters frozen except for a projection layer to mitigate overfitting.

Model Evaluation and Performance

Evaluation is conducted via 10-fold cross-validation, with distinct schemes for each missingness scenario. The primary metric is AUC (Area Under the ROC Curve), with additional reporting of accuracy and F1-score. Matrix factorization serves as a benchmark for imputation and retrodiction tasks. Figure 3

Figure 3: Model performance for predicting missing responses at individual and aggregate levels, including ROC curves and the relationship between observed and predicted agreement rates.

Key Results

  • Missing Data Imputation: Alpaca-7b achieves AUC = 0.866, comparable to matrix factorization (AUC = 0.852).
  • Retrodiction: Alpaca-7b achieves AUC = 0.860, outperforming matrix factorization (AUC = 0.798). Correlation between predicted and observed aggregate opinions exceeds 0.98.
  • Unasked Opinion Prediction: Alpaca-7b achieves AUC = 0.729, with a lower aggregate correlation (ρ\rho = 0.68), indicating the increased difficulty of this task.

The model's performance is robust across missing data mechanisms (MCAR, MAR, MNAR) and degrades gracefully as the proportion of missing data increases.

The retrodiction capability enables the reconstruction of historical opinion trends for questions introduced late or discontinued early in the GSS. For example, the model accurately reconstructs the rise in support for same-sex marriage prior to the question's introduction in 2008, and predicts stable or shifting trends for issues such as busing and vegetarianism. Figure 4

Figure 4: Counterfactual trend prediction for selected GSS questions, comparing model-based retrodictions to matrix factorization and observed data.

Heterogeneity in Predictability

Analysis of individual- and opinion-level AUCs reveals systematic heterogeneity:

  • Higher SES (education, income) and strong partisanship are associated with greater predictability.
  • Racial minorities and earlier periods (1970s) exhibit lower predictability.
  • Opinions highly correlated with political ideology are more predictable; controversial or weakly structured opinions are less so. Figure 5

    Figure 5: Coefficient plots showing subgroup differences in individual-level AUC across missing data scenarios.

    Figure 6

    Figure 6: Coefficient plots showing opinion-level predictors of AUC, including period, sample size, response variance, and ideological correlation.

Model Architecture and Embedding Visualization

The embedding spaces learned by the model reflect meaningful structure:

  • Semantic embeddings cluster questions by topic.
  • Belief embeddings cluster individuals by latent belief systems.
  • Period embeddings capture temporal proximity and historical shifts. Figure 7

    Figure 7: t-SNE visualizations of semantic, belief, and period embeddings, colored by topic, individual cluster, and year, respectively.

Implementation Considerations

  • Computational Requirements: Fine-tuning Alpaca-7b on the full GSS requires substantial GPU resources, but freezing LLM parameters and optimizing only projection and embedding layers reduces memory and compute demands.
  • Data Requirements: Model performance saturates with ~100 questions per respondent; performance is robust even with high rates of missingness.
  • Generalizability: The approach is model-agnostic and can be extended to other surveys and languages, though cross-cultural generalization remains an open question.
  • Ethical Considerations: The ability to predict unexpressed or unasked opinions raises privacy and autonomy concerns, especially for marginalized groups with lower predictability.

Theoretical and Practical Implications

The results demonstrate that fine-tuned LLMs can substantially augment traditional survey research by:

  • Enabling high-fidelity imputation and retrodiction, thus maximizing the utility of sparse or incomplete survey data.
  • Providing a principled method for counterfactual trend estimation, critical for historical and policy analysis.
  • Revealing the social and structural determinants of opinion predictability, with implications for theories of belief systems and cultural coherence.

However, the modest performance in unasked opinion prediction underscores the continued necessity of human-generated survey data for capturing the full heterogeneity of public opinion.

Future Directions

  • Multi-class and Ordinal Prediction: Extending the framework to handle non-binary response options via multi-class classification or ordinal regression.
  • Cross-Survey and Cross-Cultural Validation: Testing the transferability of fine-tuned models across different survey instruments and national contexts.
  • Dynamic and Adaptive Survey Design: Leveraging model uncertainty to optimize question selection and respondent sampling in real time.
  • Privacy-Preserving Modeling: Developing techniques to mitigate privacy risks and ensure ethical deployment in applied settings.

Conclusion

This work establishes a scalable, flexible, and empirically validated framework for AI-augmented survey research. By integrating LLMs with representative survey data and modeling individual, semantic, and temporal heterogeneity, the approach enables accurate prediction of both observed and unobserved opinions. The findings have significant implications for the design, analysis, and interpretation of social surveys, as well as for the broader integration of AI in the social sciences. The framework's limitations—particularly in predicting entirely unasked opinions and in representing minority groups—highlight the ongoing need for methodological innovation and ethical vigilance as AI becomes increasingly embedded in empirical social research.

Authors (2)

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 tweet and received 2 likes.

Upgrade to Pro to view all of the tweets about this paper: