Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Prompting a Pretrained Transformer Can Be a Universal Approximator (2402.14753v1)

Published 22 Feb 2024 in cs.LG, cs.AI, and math.FA

Abstract: Despite the widespread adoption of prompting, prompt tuning and prefix-tuning of transformer models, our theoretical understanding of these fine-tuning methods remains limited. A key question is whether one can arbitrarily modify the behavior of pretrained model by prompting or prefix-tuning it. Formally, whether prompting and prefix-tuning a pretrained model can universally approximate sequence-to-sequence functions. This paper answers in the affirmative and demonstrates that much smaller pretrained models than previously thought can be universal approximators when prefixed. In fact, the attention mechanism is uniquely suited for universal approximation with prefix-tuning a single attention head being sufficient to approximate any continuous function. Moreover, any sequence-to-sequence function can be approximated by prefixing a transformer with depth linear in the sequence length. Beyond these density-type results, we also offer Jackson-type bounds on the length of the prefix needed to approximate a function to a desired precision.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. What learning algorithm is in-context learning? Investigations with linear models. In International Conference on Learning Representations.
  2. Sumformer: Universal approximation for efficient transformers. In Proceedings of 2nd Annual Workshop on Topology, Algebra, and Geometry in Machine Learning (TAG-ML).
  3. Donald E Amos. 1974. Computation of modified Bessel functions and their ratios. Mathematics of Computation, 28(125):239–251.
  4. Kendall Atkinson and Weimin Han. 2012. Spherical Harmonics and Approximations on the Unit Sphere: An Introduction.
  5. Yogesh J Bagul and Satish K Panchal. 2018. Certain inequalities of Kober and Lazarević type. Research Group in Mathematical Inequalities and Applications Research Report Collection, 21(8).
  6. Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations.
  7. Soft prompting might be a bug, not a feature. In Workshop on Challenges in Deployable Generative AI at International Conference on Machine Learning.
  8. Alex Barnett. 2021. Lower bounds on the modified Bessel function of the first kind. Mathematics Stack Exchange.
  9. Andrew R Barron. 1993. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory, 39(3):930–945.
  10. On the expressivity role of LayerNorm in transformers’ attention. arXiv preprint arXiv:2305.02582.
  11. Language models are few-shot learners. Advances in Neural Information Processing Systems.
  12. Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419.
  13. George Cybenko. 1989. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303–314.
  14. Feng Dai and Yuan Xu. 2013. Approximation Theory and Harmonic Analysis on Spheres and Balls.
  15. Perfectly secure steganography using minimum entropy coupling. In International Conference on Learning Representations.
  16. Stolen probability: A structural weakness of neural language models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.
  17. On the optimization and generalization of multi-head attention. arXiv preprint arXiv:2310.12680.
  18. Attention is not all you need: Pure attention loses rank doubly exponentially with depth. In International Conference on Machine Learning.
  19. Ricardo Estrada. 2014. On radial functions and distributions and their Fourier transforms. Journal of Fourier Analysis and Applications, 20(2):301–320.
  20. Uriel Feige and Gideon Schechtman. 2002. On the optimality of the random hyperplane rounding technique for MAX CUT. Random Structures & Algorithms, 20(3):403–440.
  21. Paul Funk. 1915. Beiträge zur Theorie der Kugelfunktionen. Mathematische Annalen, 77:136–152.
  22. Federico Girosi and Tomaso Poggio. 1989. Representation properties of networks: Kolmogorov’s theorem is irrelevant. Neural Computation, 1(4):465–469.
  23. E Hecke. 1917. Über orthogonal-invariante Integralgleichungen. Mathematische Annalen, 78:398–404.
  24. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366.
  25. Parameter-efficient transfer learning for NLP. In International Conference on Machine Learning.
  26. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations.
  27. LLM-Adapters: An adapter family for parameter-efficient fine-tuning of large language models. arXiv preprint arXiv:2304.01933.
  28. Haotian Jiang and Qianxiao Li. 2023. Approximation theory of transformer networks for sequence modeling. arXiv preprint arXiv:2305.18475.
  29. A brief survey on the approximation theory for sequence modelling. arXiv preprint arXiv:2302.13752.
  30. Large language models are zero-shot reasoners. Advances in Neural Information Processing Systems.
  31. Andrei Nikolaevich Kolmogorov. 1957. On the representation of continuous functions of many variables by superposition of continuous functions of one variable and addition. In Doklady Akademii Nauk, volume 114, pages 953–956. Russian Academy of Sciences.
  32. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing.
  33. Shengqiao Li. 2010. Concise formulas for the area and volume of a hyperspherical cap. Asian Journal of Mathematics & Statistics, 4(1):66–70.
  34. Xiang Lisa Li and Percy Liang. 2021. Prefix-Tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers).
  35. Scaling down to scale up: A guide to parameter-efficient fine-tuning. arXiv preprint arXiv:2303.15647.
  36. On the expressive power of self-attention matrices. arXiv preprint arXiv:2106.03764.
  37. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys.
  38. P-Tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers).
  39. Memorization capacity of multi-head attention in transformers. arXiv preprint arXiv:2306.02010.
  40. Valdir Antônio Menegatto. 1997. Approximation by spherical convolution. Numerical Functional Analysis and Optimization, 18(9-10):995–1012.
  41. Tin Lok James Ng and Kwok-Kun Kwong. 2022. Universal approximation on the hypersphere. Communications in Statistics – Theory and Methods, 51(24):8694–8704.
  42. When do prompting and prefix-tuning work? A theory of capabilities and limitations. In International Conference on Learning Representations.
  43. David L Ragozin. 1971. Constructive polynomial approximation on spheres and projective spaces. Transactions of the American Mathematical Society, 162:157–170.
  44. Learning multiple visual domains with residual adapters. In Advances in Neural Information Processing Systems.
  45. C. A. Rogers. 1963. Covering a sphere with spheres. Mathematika, 10(2):157–164.
  46. Representational strengths and limitations of transformers. arXiv preprint arXiv:2306.02896.
  47. Johannes Schmidt-Hieber. 2021. The Kolmogorov–Arnold representation theorem revisited. Neural Networks, 137:119–126.
  48. AutoPrompt: Eliciting knowledge from language models with automatically generated prompts. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP).
  49. Matus Telgarsky. 2015. Representation benefits of deep feedforward networks. arXiv preprint arXiv:1509.08101.
  50. Attention is all you need. In Advances in Neural Information Processing Systems.
  51. Transformers learn in-context by gradient descent. In International Conference on Machine Learning.
  52. Universality and limitations of prompt tuning. In Advances in Neural Information Processing Systems.
  53. Finetuned language models are zero-shot learners. In International Conference on Learning Representations.
  54. Fundamental limitations of alignment in large language models. arXiv preprint arXiv:2304.11082.
  55. An explanation of in-context learning as implicit Bayesian inference. In International Conference on Learning Representations.
  56. Pretraining data mixtures enable narrow model selection capabilities in transformer models. arXiv preprint arXiv:2311.00871.
  57. Are transformers universal approximators of sequence-to-sequence functions? In International Conference on Learning Representations.
  58. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043.
Citations (7)

Summary

  • The paper shows that a single attention head with proper prefix-tuning can universally approximate any smooth continuous function on hyperspheres.
  • It derives theoretical bounds linking the necessary prefix length to the desired approximation error and the sequence length.
  • The findings imply that efficient transformer tuning can drive the design of novel architectures and prompt strategies for robust AI systems.

Exploring the Depths of Prefix-Tuning: A Theoretical Perspective on Transformer Universality

Introduction

Recent advancements in LLMs and the generative AI domain have significantly benefitted from the efficient fine-tuning of transformer models. While practices such as prompting, prefix-tuning, and soft prompting have become commonplace, their theoretical underpinnings, particularly pertaining to their efficacy in universally approximating sequence-to-sequence functions, remain largely unexplored. This investigation aims to theoretically analyze whether prompting and prefix-tuning can serve as universal approximators, thereby potentially modifying the output of a pretrained transformer to arbitrary precision.

Theoretical Framework

To understand the capabilities of transformers in accommodating various sequence-to-sequence transformations through prefixing, we consider a formal evaluation through the lenses of universal approximation. Leveraging the theoretical groundwork laid by traditional neural network approximation theorems, we scrutinize the attention mechanism’s ability to approximate continuous functions on hyperspheres. Particularly, we show that with prefix-tuning, a single attention head can effectively approximate any smooth continuous function, underlining the transformer architecture’s inherent universality. Furthermore, we derive bounds on the necessary prefix length to achieve a desired approximation error, enhancing our comprehension of the transformer’s approximation capacities.

Results and Observations

Our analysis reveals several noteworthy outcomes:

  • A single attention head, when properly prefixed, suffices for approximating any continuous function on a hypersphere, revealing the inherent universality of the attention mechanism.
  • The depth of the transformer required for approximation scales linearly with the sequence length, independent of the desired accuracy, presenting an efficient scheme contrary to existing beliefs on necessitating depth adjustments for accuracy improvements.
  • We unfurl that the prompt or prefix length scales unfavorably with the target function’s complexity and the desired approximation error. This suggests limitations in the practical applicability of prefix-tuning and prompting, especially for complex functions or higher desired accuracies.

Practical Implications

The unraveled theoretical insights offer profound implications for both the design and application of transformer models:

  • This work suggests a pathway to ensuring that a pretrained model has the intrinsic capability to function as a token-wise universal approximator by incorporating specific attention heads during training.
  • The findings may guide the development of novel transformer architectures optimized for efficient prefix-tuning and prompting, aiming at enhanced model adaptability with minimal training overhead.
  • Understanding the bounds of prefix length for desired accuracy levels aids in estimating the practicability and computational feasibility of prompting strategies for given tasks.

Future Directions

Despite its theoretical rigor, our examination adheres to a specific class of predefined transformer models, potentially divergent from those trained on real-world datasets. This opens avenues for future work to investigate the approximation capabilities of realistically pretrained transformers with prefix-tuning. Moreover, probing into the inverse bounds and exploring the practical limits of prefix-tuning and prompting in real-world applications constitute essential steps forward.

Conclusion

The elucidation of the theoretical aspects of prefix-tuning and prompting as universal approximators not only adds a significant chapter to the understanding of transformers but also paves the path for designing more robust, adaptable, and efficient AI systems. The theoretical milestones achieved beckon further empirical and theoretical exploration, promising an exciting trajectory for future research in the field of transformer models.