Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Large Language Model Supply Chain: A Research Agenda (2404.12736v3)

Published 19 Apr 2024 in cs.SE

Abstract: The rapid advancement of LLMs has revolutionized artificial intelligence, introducing unprecedented capabilities in natural language processing and multimodal content generation. However, the increasing complexity and scale of these models have given rise to a multifaceted supply chain that presents unique challenges across infrastructure, foundation models, and downstream applications. This paper provides the first comprehensive research agenda of the LLM supply chain, offering a structured approach to identify critical challenges and opportunities through the dual lenses of software engineering (SE) and security & privacy (S&P). We begin by establishing a clear definition of the LLM supply chain, encompassing its components and dependencies. We then analyze each layer of the supply chain, presenting a vision for robust and secure LLM development, reviewing the current state of practices and technologies, and identifying key challenges and research opportunities. This work aims to bridge the existing research gap in systematically understanding the multifaceted issues within the LLM supply chain, offering valuable insights to guide future efforts in this rapidly evolving domain.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (233)
  1. Securing Large Language Models: Threats, Vulnerabilities and Responsible Practices. arXiv preprint arXiv:2403.12503 (2024).
  2. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
  3. A Survey on Data Selection for Language Models. arXiv preprint arXiv:2402.16827 (2024).
  4. Online continual learning with maximal interfered retrieval. Advances in neural information processing systems 32 (2019).
  5. Eleni Angelou. 2022. Three Scenarios of Pseudo-Alignment. https://www.lesswrong.com/posts/W5nnfgWkCPxDvJMpe/three-scenarios-of-pseudo-alignment. Accessed: 2024-03-28.
  6. Assuring the machine learning lifecycle: Desiderata, methods, and challenges. ACM Computing Surveys (CSUR) 54, 5 (2021), 1–39.
  7. Your fairness may vary: Pretrained language model fairness in toxic text classification. arXiv preprint arXiv:2108.01250 (2021).
  8. Leak, cheat, repeat: Data contamination and evaluation malpractices in closed-source llms. arXiv preprint arXiv:2402.03927 (2024).
  9. ANN BARCOMB and DIRK RIEHLE. 2022. Open Source License Inconsistencies on GitHub. (2022).
  10. On the dangers of stochastic parrots: Can language models be too big?. In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency. 610–623.
  11. Towards building a robust toxicity predictor. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track). 581–598.
  12. Purple llama cyberseceval: A secure coding benchmark for language models. arXiv preprint arXiv:2312.04724 (2023).
  13. Benchmarking knowledge-enhanced commonsense question answering via knowledge-to-text transformation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 12574–12582.
  14. Continual lifelong learning in natural language processing: A survey. arXiv preprint arXiv:2012.09823 (2020).
  15. Into the LAION’s Den: Investigating hate in multimodal datasets. Advances in Neural Information Processing Systems 36 (2024).
  16. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021).
  17. Andrei Z Broder. 1997. On the resemblance and containment of documents. In Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171). IEEE, 21–29.
  18. Distributionally robust finetuning BERT for covariate drift in spoken language understanding. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1970–1985.
  19. What does it mean for a language model to preserve privacy?. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. 2280–2292.
  20. Online learned continual compression with adaptive quantization modules. In International conference on machine learning. PMLR, 1240–1250.
  21. Concerned with Data Contamination? Assessing Countermeasures in Code Language Model. arXiv preprint arXiv:2403.16898 (2024).
  22. Extracting training data from diffusion models. In 32nd USENIX Security Symposium (USENIX Security 23). 5253–5270.
  23. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21). 2633–2650.
  24. Riemannian walk for incremental learning: Understanding forgetting and intransigence. In Proceedings of the European conference on computer vision (ECCV). 532–547.
  25. Survey: Exploiting data redundancy for optimization of deep learning. Comput. Surveys 55, 10 (2023), 1–38.
  26. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021).
  27. LOFS: A lightweight online file storage strategy for effective data deduplication at network edge. IEEE Transactions on Parallel and Distributed Systems 33, 10 (2021), 2263–2276.
  28. Comprehensive assessment of jailbreak attacks against llms. arXiv preprint arXiv:2402.05668 (2024).
  29. How to Protect Copyright Data in Optimization of Large Language Models?. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 17871–17879.
  30. MATTEO CITTERIO. 2022. A drift detection framework for large language models. (2022).
  31. Chatlaw: Open-source legal large language model with integrated external knowledge bases. arXiv preprint arXiv:2306.16092 (2023).
  32. Risk taxonomy, mitigation, and assessment benchmarks of large language model systems. arXiv preprint arXiv:2401.05778 (2024).
  33. Reusing deep learning models: Challenges and directions in software engineering. In 2023 IEEE John Vincent Atanasoff International Symposium on Modern Computing (JVA). IEEE, 17–30.
  34. Evading Data Contamination Detection for Language Models is (too) Easy. arXiv preprint arXiv:2402.02823 (2024).
  35. Benchmark probing: Investigating data leakage in large language models. In NeurIPS 2023 Workshop on Backdoors in Deep Learning-The Good, the Bad, and the Ugly.
  36. MASTERKEY: Automated jailbreaking of large language model chatbots. In Proc. ISOC NDSS.
  37. An exploratory study of the evolution of software licensing. In Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering-Volume 1. 145–154.
  38. Codescore: Evaluating code generation by learning code execution. arXiv preprint arXiv:2301.09043 (2023).
  39. Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models. arXiv preprint arXiv:2402.15938 (2024).
  40. Evaluating Large Language Models in Class-Level Code Generation. In 2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE). IEEE Computer Society, 865–865.
  41. Flocks of stochastic parrots: Differentially private prompt learning for large language models. Advances in Neural Information Processing Systems 36 (2024).
  42. Towards measuring supply chain attacks on package managers for interpreted languages. arXiv preprint arXiv:2002.01139 (2020).
  43. Cynthia Dwork. 2006. Differential privacy. In International colloquium on automata, languages, and programming. Springer, 1–12.
  44. Lmentry: A language model benchmark of elementary language tasks. arXiv preprint arXiv:2211.02069 (2022).
  45. Hugging Face. 2024. Hugging Face. https://huggingface.co/. Accessed: 2024-03-28.
  46. Compressing visual-linguistic model via knowledge distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1428–1438.
  47. Sebastian Farquhar and Yarin Gal. 2018. Towards robust evaluations of continual learning. arXiv preprint arXiv:1805.09733 (2018).
  48. Emilio Ferrara. 2023. Should chatgpt be biased? challenges and risks of bias in large language models. arXiv preprint arXiv:2304.03738 (2023).
  49. Center for Research on Foundation Models (CRFM). 2024. The Foundation Model Transparency Index. https://crfm.stanford.edu/fmti/. Accessed: 2024-03-28.
  50. Why people hate your app: Making sense of user feedback in a mobile app store. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. 1276–1284.
  51. Misusing Tools in Large Language Models With Visual Adversarial Examples. arXiv preprint arXiv:2310.03185 (2023).
  52. Bias and fairness in large language models: A survey. arXiv preprint arXiv:2309.00770 (2023).
  53. Rarr: Researching and revising what language models say, using language models. arXiv preprint arXiv:2210.08726 (2022).
  54. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. arXiv preprint arXiv:2009.11462 (2020).
  55. A sentence-matching method for automatic license identification of source code files. In Proceedings of the 25th IEEE/ACM International Conference on Automated Software Engineering. 437–446.
  56. Shahriar Golchin and Mihai Surdeanu. 2023a. Data contamination quiz: A tool to detect and estimate contamination in large language models. arXiv preprint arXiv:2311.06233 (2023).
  57. Shahriar Golchin and Mihai Surdeanu. 2023b. Time travel in llms: Tracing data contamination in large language models. arXiv preprint arXiv:2308.08493 (2023).
  58. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH) 3, 1 (2021), 1–23.
  59. Xiezhi: An ever-updating benchmark for holistic domain knowledge evaluation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 18099–18107.
  60. Detecting and preventing hallucinations in large vision language models. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 18135–18143.
  61. What can large language models do in chemistry? a comprehensive benchmark on eight tasks. Advances in Neural Information Processing Systems 36 (2023), 59662–59688.
  62. An Empirical Study of Malicious Code In PyPI Ecosystem. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 166–177.
  63. Model Editing at Scale leads to Gradual and Catastrophic Forgetting. arXiv preprint arXiv:2401.07453 (2024).
  64. Aixbench: A code generation benchmark dataset. arXiv preprint arXiv:2206.13179 (2022).
  65. Memory efficient experience replay for streaming learning. In 2019 International Conference on Robotics and Automation (ICRA). IEEE, 9769–9776.
  66. Julian Hazell. 2023. Spear phishing with large language models. arXiv preprint arXiv:2305.06972 (2023).
  67. Software ecosystem call graph for dependency management. In Proceedings of the 40th International Conference on Software Engineering: New Ideas and Emerging Results. 101–104.
  68. Scaling laws and interpretability of learning from repeated data. arXiv preprint arXiv:2205.10487 (2022).
  69. Empirical Analysis of Vulnerabilities Life Cycle in Golang Ecosystem. arXiv preprint arXiv:2401.00515 (2023).
  70. Flames: Benchmarking value alignment of chinese large language models. arXiv preprint arXiv:2311.06899 (2023).
  71. Risks from learned optimization in advanced machine learning systems. arXiv preprint arXiv:1906.01820 (2019).
  72. Towards accountability for machine learning datasets: Practices from software engineering and infrastructure. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. 560–575.
  73. Tiny, Always-on, and Fragile: Bias Propagation through Design Choices in On-device Machine Learning Workflows. ACM Transactions on Software Engineering and Methodology 32, 6 (2023), 1–37.
  74. Overview and importance of data quality for machine learning tasks. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining. 3561–3562.
  75. LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code. arXiv preprint arXiv:2403.07974 (2024).
  76. Beavertails: Towards improved safety alignment of llm via a human-preference dataset. Advances in Neural Information Processing Systems 36 (2024).
  77. Survey of hallucination in natural language generation. Comput. Surveys 55, 12 (2023), 1–38.
  78. Towards Mitigating Hallucination in Large Language Models via Self-Reflection. arXiv preprint arXiv:2310.06271 (2023).
  79. An empirical study of pre-trained model reuse in the hugging face deep learning model registry. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2463–2475.
  80. An empirical study of artifacts and security risks in the pre-trained model supply chain. In Proceedings of the 2022 ACM Workshop on Software Supply Chain Offensive Research and Ecosystem Defenses. 105–114.
  81. Alignment rationale for natural language inference. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 5372–5387.
  82. Large language models struggle to learn long-tail knowledge. In International Conference on Machine Learning. PMLR, 15696–15707.
  83. Deduplicating training data mitigates privacy risks in language models. In International Conference on Machine Learning. PMLR, 10697–10707.
  84. Copyright violations and large language models. arXiv preprint arXiv:2310.13771 (2023).
  85. An Exploratory Investigation into Code License Infringements in Large Language Model Training Datasets. arXiv preprint arXiv:2403.15230 (2024).
  86. Achieving forgetting prevention and knowledge transfer in continual learning. Advances in Neural Information Processing Systems 34 (2021), 22443–22456.
  87. Memory-efficient fine-tuning of compressed large language models via sub-4-bit integer quantization. Advances in Neural Information Processing Systems 36 (2024).
  88. Propile: Probing privacy leakage in large language models. Advances in Neural Information Processing Systems 36 (2024).
  89. Bias out-of-the-box: An empirical analysis of intersectional occupational biases in popular generative language models. Advances in neural information processing systems 34 (2021), 2611–2624.
  90. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences 114, 13 (2017), 3521–3526.
  91. Similarity of neural network models: A survey of functional and representational measures. arXiv preprint arXiv:2305.06329 (2023).
  92. Openassistant conversations-democratizing large language model alignment. Advances in Neural Information Processing Systems 36 (2024).
  93. Taxonomy of attacks on open-source software supply chains. arXiv preprint arXiv:2204.04008 (2022).
  94. LangChain-AI. 2024. LangChain. https://github.com/langchain-ai/langchain. Accessed: 2024-03-28.
  95. Gunwoong Lee and T Santanam Raghu. 2014. Determinants of mobile apps’ success: Evidence from the app store market. Journal of Management Information Systems 31, 2 (2014), 133–170.
  96. Deduplicating training data makes language models better. arXiv preprint arXiv:2107.06499 (2021).
  97. Factuality enhanced language models for open-ended text generation. Advances in Neural Information Processing Systems 35 (2022), 34586–34599.
  98. Regularization shortcomings for continual learning. arXiv preprint arXiv:1912.03049 (2019).
  99. Changmao Li and Jeffrey Flanigan. 2024. Task contamination: Language models may not be few-shot anymore. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 18471–18480.
  100. Digger: Detecting Copyright Content Mis-usage in Large Language Model Training. arXiv preprint arXiv:2401.00676 (2024).
  101. Halueval: A large-scale hallucination evaluation benchmark for large language models. In The 2023 Conference on Empirical Methods in Natural Language Processing.
  102. Measuring and Controlling Persona Drift in Language Model Dialogs. arXiv preprint arXiv:2402.10962 (2024).
  103. Inference-time intervention: Eliciting truthful answers from a language model. Advances in Neural Information Processing Systems 36 (2024).
  104. Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs. arXiv preprint arXiv:2403.20041 (2024).
  105. t-closeness: Privacy beyond k-anonymity and l-diversity. In 2007 IEEE 23rd international conference on data engineering. IEEE, 106–115.
  106. MalWuKong: Towards Fast, Accurate, and Multilingual Detection of Malicious Code Poisoning in OSS Supply Chains. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 1993–2005.
  107. Taco: Topics in algorithmic code generation dataset. arXiv preprint arXiv:2312.14852 (2023).
  108. Hidden backdoors in human-centric language models. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security. 3123–3140.
  109. Yucheng Li. 2023. Estimating contamination via perplexity: Quantifying memorisation in language model evaluation. arXiv preprint arXiv:2309.10677 (2023).
  110. Latesteval: Addressing data contamination in language model evaluation through dynamic and time-sensitive test construction. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 18600–18607.
  111. Multi-target backdoor attacks for code pre-trained models. arXiv preprint arXiv:2306.08350 (2023).
  112. ModelDiff: Testing-based DNN similarity comparison for model reuse detection. In Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis. 139–151.
  113. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110 (2022).
  114. Towards understanding and mitigating social biases in language models. In International Conference on Machine Learning. PMLR, 6565–6576.
  115. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958 (2021).
  116. Demystifying the vulnerability propagation and its evolution via dependency trees in the npm ecosystem. In Proceedings of the 44th International Conference on Software Engineering. 672–684.
  117. Chain of hindsight aligns language models with feedback. arXiv preprint arXiv:2302.02676 (2023).
  118. A survey on hallucination in large vision-language models. arXiv preprint arXiv:2402.00253 (2024).
  119. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. Advances in Neural Information Processing Systems 36 (2024).
  120. Quantifying and alleviating political bias in language models. Artificial Intelligence 304 (2022), 103654.
  121. Datasets for Large Language Models: A Comprehensive Survey. arXiv preprint arXiv:2402.18041 (2024).
  122. Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860 (2023).
  123. Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models’ Alignment. arXiv preprint arXiv:2308.05374 (2023).
  124. MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases. arXiv preprint arXiv:2402.14905 (2024).
  125. The data provenance initiative: A large scale audit of dataset licensing & attribution in ai. arXiv preprint arXiv:2310.16787 (2023).
  126. Analyzing leakage of personally identifiable information in language models. In 2023 IEEE Symposium on Security and Privacy (SP). IEEE, 346–363.
  127. On the experiences of adopting automated data validation in an industrial machine learning project. In 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 248–257.
  128. Llm-pruner: On the structural pruning of large language models. Advances in neural information processing systems 36 (2023), 21702–21720.
  129. Yuxing Ma. 2018. Constructing supply chains in open source software. In Proceedings of the 40th International Conference on Software Engineering: Companion Proceeedings. 458–459.
  130. l-diversity: Privacy beyond k-anonymity. Acm transactions on knowledge discovery from data (tkdd) 1, 1 (2007), 3–es.
  131. Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems 36 (2024).
  132. DetAIL: a tool to automatically detect and analyze drift in language. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 15767–15773.
  133. Pooria Madani. 2023. Metamorphic Malware Evolution: The Potential and Peril of Large Language Models. In 2023 5th IEEE International Conference on Trust, Privacy and Security in Intelligent Systems and Applications (TPS-ISA). IEEE Computer Society, 74–81.
  134. Inbal Magar and Roy Schwartz. 2022. Data contamination: From memorization to exploitation. arXiv preprint arXiv:2203.08242 (2022).
  135. A survey of app store analysis for software engineering. IEEE transactions on software engineering 43, 9 (2016), 817–847.
  136. Dataperf: Benchmarks for data-centric ai development. Advances in Neural Information Processing Systems 36 (2024).
  137. Inadequacies of large language model benchmarks in the era of generative artificial intelligence. arXiv preprint arXiv:2402.09880 (2024).
  138. Generating benchmarks for factuality evaluation of language models. arXiv preprint arXiv:2307.06908 (2023).
  139. Socialstigmaqa: A benchmark to uncover stigma amplification in generative language models. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 21454–21462.
  140. Vamsa: Automated provenance tracking in data science scripts. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining. 1542–1551.
  141. Biases in large language models: origins, inventory, and discussion. ACM Journal of Data and Information Quality 15, 2 (2023), 1–21.
  142. The Hacker News. 2024. New Hugging Face Vulnerability Exposes AI Models to Supply Chain Attacks. https://thehackernews.com/2024/02/new-hugging-face-vulnerability-exposes.html. Accessed: 2024-03-28.
  143. Variational continual learning. arXiv preprint arXiv:1710.10628 (2017).
  144. OpenAI. 2024a. GPTs. https://chat.openai.com/gpts. Accessed: 2024-03-28.
  145. OpenAI. 2024b. Introducing the GPT Store. https://openai.com/blog/introducing-the-gpt-store. Accessed: 2024-03-28.
  146. Proving test set contamination in black box language models. arXiv preprint arXiv:2310.17623 (2023).
  147. Kurez Oroy and Julia Evan. 2024. Continual Learning with Large Language Models: Adapting to Concept Drift and New Data Streams. Technical Report. EasyChair.
  148. Probing toxic content in large pre-trained language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 4262–4274.
  149. Training language models to follow instructions with human feedback. Advances in neural information processing systems 35 (2022), 27730–27744.
  150. OWASP. 2024. OWASP Top 10 for Large Language Model Applications. https://owasp.org/www-project-top-10-for-large-language-model-applications/. Accessed: 2024-03-28.
  151. Challenges in deploying machine learning: a survey of case studies. ACM computing surveys 55, 6 (2022), 1–29.
  152. Reducing gender bias in abusive language detection. arXiv preprint arXiv:1808.07231 (2018).
  153. Toxicity detection: Does context really matter? arXiv preprint arXiv:2006.00998 (2020).
  154. Check your facts and try again: Improving large language models with external knowledge and automated feedback. arXiv preprint arXiv:2302.12813 (2023).
  155. You are what you write: Preserving privacy in the era of large language models. arXiv preprint arXiv:2204.09391 (2022).
  156. Data validation for machine learning. Proceedings of machine learning and systems 1 (2019), 334–347.
  157. Unveiling the Veil: A Comprehensive Analysis of Data Contamination in Leading Language Models. (2024).
  158. PyTorch. 2024. PyTorch. https://github.com/pytorch/pytorch. Accessed: 2024-03-28.
  159. Visual adversarial examples jailbreak aligned large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 21527–21536.
  160. The Record. 2024. Thousands of companies using Ray framework exposed to cyberattacks, researchers say. https://therecord.media/thousands-exposed-to-ray-framework-vulnerability. Accessed: 2024-03-28.
  161. Experience replay for continual learning. Advances in neural information processing systems 32 (2019).
  162. Improving reproducibility of data science pipelines through transparent provenance capture. Proceedings of the VLDB Endowment 13, 12 (2020), 3354–3368.
  163. “Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (¡conf-loc¿, ¡city¿Yokohama¡/city¿, ¡country¿Japan¡/country¿, ¡/conf-loc¿) (CHI ’21). Association for Computing Machinery, New York, NY, USA, Article 39, 15 pages. https://doi.org/10.1145/3411764.3445518
  164. Understanding free/open source software development processes. , 95–105 pages.
  165. Just how toxic is data poisoning? a unified benchmark for backdoor and data poisoning attacks. In International Conference on Machine Learning. PMLR, 9389–9398.
  166. Operationalizing machine learning: An interview study. arXiv preprint arXiv:2209.09125 (2022).
  167. Large language model alignment: A survey. arXiv preprint arXiv:2309.15025 (2023).
  168. CORECODE: A Common Sense Annotated Dialogue Dataset with Benchmark Tasks for Chinese Large Language Models. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 18952–18960.
  169. Selective differential privacy for language modeling. arXiv preprint arXiv:2108.12944 (2021).
  170. Just fine-tune twice: Selective differential privacy for large language models. arXiv preprint arXiv:2204.07667 (2022).
  171. Karthik Shivashankar and Antonio Martini. 2022. Maintainability challenges in ML: A systematic literature review. In 2022 48th Euromicro Conference on Software Engineering and Advanced Applications (SEAA). IEEE, 60–67.
  172. Prompting gpt-3 to be reliable. arXiv preprint arXiv:2210.09150 (2022).
  173. Rethinking Interpretability in the Era of Large Language Models. arXiv preprint arXiv:2402.01761 (2024).
  174. BOMs Away! Inside the Minds of Stakeholders: A Comprehensive Study of Bills of Materials for Software Systems. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering. 1–13.
  175. Trevor Wayne Stalnaker. 2023. A Comprehensive Study of Bills of Materials for Software Systems. Ph. D. Dissertation. The College of William and Mary.
  176. Upstream mitigation is not all you need: Testing the bias transfer hypothesis in pre-trained language models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 3524–3542.
  177. Scieval: A multi-level large language model evaluation benchmark for scientific research. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 19053–19061.
  178. Coprotector: Protect open-source code against unauthorized training usage with data poisoning. In Proceedings of the ACM Web Conference 2022. 652–660.
  179. Latanya Sweeney. 2002. k-anonymity: A model for protecting privacy. International journal of uncertainty, fuzziness and knowledge-based systems 10, 05 (2002), 557–570.
  180. Sparsity-guided holistic explanation for llms with interpretable inference-time intervention. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 21619–21627.
  181. Sac: A system for big data lineage tracking. In 2019 IEEE 35th International Conference on Data Engineering (ICDE). IEEE, 1964–1967.
  182. Deep Learning Model Reuse in the HuggingFace Community: Challenges, Benefit and Trends. arXiv preprint arXiv:2401.13177 (2024).
  183. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023).
  184. TensorFlow. 2024. TensorFlow. https://github.com/tensorflow/tensorflow. Accessed: 2024-03-28.
  185. D4: Improving llm pretraining via document de-duplication and diversification. Advances in Neural Information Processing Systems 36 (2024).
  186. A comprehensive survey of hallucination mitigation techniques in large language models. arXiv preprint arXiv:2401.01313 (2024).
  187. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
  188. Automated software license analysis. Automated Software Engineering 16 (2009), 455–490.
  189. Planbench: An extensible benchmark for evaluating large language models on planning and reasoning about change. Advances in Neural Information Processing Systems 36 (2024).
  190. Tracing software build processes to uncover license compliance inconsistencies. In Proceedings of the 29th ACM/IEEE international conference on Automated software engineering. 731–742.
  191. A stitch in time saves nine: Detecting and mitigating hallucinations of llms by validating low-confidence generation. arXiv preprint arXiv:2307.03987 (2023).
  192. Christopher Vendome. 2015. A large scale study of license usage on github. In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, Vol. 2. IEEE, 772–774.
  193. License usage and changes: a large-scale study on github. Empirical Software Engineering 22 (2017), 1537–1577.
  194. License usage and changes: a large-scale study of java projects on github. In 2015 IEEE 23rd International Conference on Program Comprehension. IEEE, 218–228.
  195. Machine learning-based detection of open source license exceptions. In 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE). IEEE, 118–129.
  196. Christopher Vendome and Denys Poshyvanyk. 2016. Assisting developers with license compliance. In Proceedings of the 38th International Conference on Software Engineering Companion. 811–814.
  197. Big data provenance: Challenges, state of the art and opportunities. In 2015 IEEE international conference on big data (Big Data). IEEE, 2509–2516.
  198. Adversarial demonstration attacks on large language models. arXiv preprint arXiv:2305.14950 (2023).
  199. Taxonomy of risks posed by language models. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. 214–229.
  200. Challenges in detoxifying language models. arXiv preprint arXiv:2109.07445 (2021).
  201. Fundamental limitations of alignment in large language models. arXiv preprint arXiv:2304.11082 (2023).
  202. Open source license inconsistencies on github. ACM Transactions on Software Engineering and Methodology 32, 5 (2023), 1–23.
  203. One teacher is enough? pre-trained language model distillation from multiple teachers. arXiv preprint arXiv:2106.01023 (2021).
  204. Ming-Wei Wu and Ying-Dar Lin. 2001. Open Source software development: An overview. Computer 34, 6 (2001), 33–38.
  205. Bloomberggpt: A large language model for finance. arXiv preprint arXiv:2303.17564 (2023).
  206. Pretrained language model in continual learning: A comparative study. In International conference on learning representations.
  207. Analysis of license inconsistency in large collections of open source projects. Empirical Software Engineering 22 (2017), 1194–1222.
  208. An empirical study on software bill of materials: Where we stand and the road ahead. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2630–2642.
  209. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning. PMLR, 38087–38099.
  210. Production machine learning pipelines: Empirical analysis and optimization opportunities. In Proceedings of the 2021 International Conference on Management of Data. 2639–2652.
  211. From dense to sparse: Contrastive pruning for better pre-trained language model compression. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 11547–11555.
  212. Data poisoning attacks against multimodal encoders. In International Conference on Machine Learning. PMLR, 39299–39313.
  213. Robustness, security, privacy, explainability, efficiency, and usability of large language models for code. arXiv preprint arXiv:2403.07506 (2024).
  214. A survey on large language model (llm) security and privacy: The good, the bad, and the ugly. High-Confidence Computing (2024), 100211.
  215. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. Advances in Neural Information Processing Systems 35 (2022), 27168–27183.
  216. Edgemoe: Fast on-device inference of moe-based large language models. arXiv preprint arXiv:2308.14352 (2023).
  217. LLM as a System Service on Mobile Devices. arXiv preprint arXiv:2403.11805 (2024).
  218. Unlearning bias in language models by partitioning gradients. In Findings of the Association for Computational Linguistics: ACL 2023. 6032–6048.
  219. Codereval: A benchmark of pragmatic code generation with generative pre-trained models. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering. 1–12.
  220. Leveraging generative AI and large Language models: a Comprehensive Roadmap for Healthcare Integration. In Healthcare, Vol. 11. MDPI, 2776.
  221. Large language model as attributed training data generator: A tale of diversity and bias. Advances in Neural Information Processing Systems 36 (2024).
  222. Task agnostic continual learning using online variational bayes. arXiv preprint arXiv:1803.10123 (2018).
  223. Text-to-image diffusion models can be easily backdoored through multimodal data poisoning. In Proceedings of the 31st ACM International Conference on Multimedia. 1577–1587.
  224. On the Safety of Open-Sourced Large Language Models: Does Alignment Really Prevent Them From Being Misused? arXiv preprint arXiv:2310.01581 (2023).
  225. Efficient toxic content detection by bootstrapping and distilling large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 21779–21787.
  226. Mitigating persistence of open-source vulnerabilities in maven ecosystem. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 191–203.
  227. “Who said it, and Why?” Provenance for Natural Language Claims. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 4416–4426.
  228. Siren’s song in the AI ocean: a survey on hallucination in large language models. arXiv preprint arXiv:2309.01219 (2023).
  229. On the impact of sample duplication in machine-learning-based android malware detection. ACM Transactions on Software Engineering and Methodology (TOSEM) 30, 3 (2021), 1–38.
  230. On evaluating adversarial robustness of large vision-language models. Advances in Neural Information Processing Systems 36 (2024).
  231. Felm: Benchmarking factuality evaluation of large language models. Advances in Neural Information Processing Systems 36 (2024).
  232. EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models. arXiv preprint arXiv:2403.12171 (2024).
  233. Red teaming chatgpt via jailbreaking: Bias, robustness, reliability and toxicity. arXiv preprint arXiv:2301.12867 (2023).
Citations (8)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com