Prompting a Pretrained Transformer Can Be a Universal Approximator (2402.14753v1)
Abstract: Despite the widespread adoption of prompting, prompt tuning and prefix-tuning of transformer models, our theoretical understanding of these fine-tuning methods remains limited. A key question is whether one can arbitrarily modify the behavior of pretrained model by prompting or prefix-tuning it. Formally, whether prompting and prefix-tuning a pretrained model can universally approximate sequence-to-sequence functions. This paper answers in the affirmative and demonstrates that much smaller pretrained models than previously thought can be universal approximators when prefixed. In fact, the attention mechanism is uniquely suited for universal approximation with prefix-tuning a single attention head being sufficient to approximate any continuous function. Moreover, any sequence-to-sequence function can be approximated by prefixing a transformer with depth linear in the sequence length. Beyond these density-type results, we also offer Jackson-type bounds on the length of the prefix needed to approximate a function to a desired precision.
- What learning algorithm is in-context learning? Investigations with linear models. In International Conference on Learning Representations.
- Sumformer: Universal approximation for efficient transformers. In Proceedings of 2nd Annual Workshop on Topology, Algebra, and Geometry in Machine Learning (TAG-ML).
- Donald E Amos. 1974. Computation of modified Bessel functions and their ratios. Mathematics of Computation, 28(125):239–251.
- Kendall Atkinson and Weimin Han. 2012. Spherical Harmonics and Approximations on the Unit Sphere: An Introduction.
- Yogesh J Bagul and Satish K Panchal. 2018. Certain inequalities of Kober and Lazarević type. Research Group in Mathematical Inequalities and Applications Research Report Collection, 21(8).
- Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations.
- Soft prompting might be a bug, not a feature. In Workshop on Challenges in Deployable Generative AI at International Conference on Machine Learning.
- Alex Barnett. 2021. Lower bounds on the modified Bessel function of the first kind. Mathematics Stack Exchange.
- Andrew R Barron. 1993. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory, 39(3):930–945.
- On the expressivity role of LayerNorm in transformers’ attention. arXiv preprint arXiv:2305.02582.
- Language models are few-shot learners. Advances in Neural Information Processing Systems.
- Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419.
- George Cybenko. 1989. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303–314.
- Feng Dai and Yuan Xu. 2013. Approximation Theory and Harmonic Analysis on Spheres and Balls.
- Perfectly secure steganography using minimum entropy coupling. In International Conference on Learning Representations.
- Stolen probability: A structural weakness of neural language models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.
- On the optimization and generalization of multi-head attention. arXiv preprint arXiv:2310.12680.
- Attention is not all you need: Pure attention loses rank doubly exponentially with depth. In International Conference on Machine Learning.
- Ricardo Estrada. 2014. On radial functions and distributions and their Fourier transforms. Journal of Fourier Analysis and Applications, 20(2):301–320.
- Uriel Feige and Gideon Schechtman. 2002. On the optimality of the random hyperplane rounding technique for MAX CUT. Random Structures & Algorithms, 20(3):403–440.
- Paul Funk. 1915. Beiträge zur Theorie der Kugelfunktionen. Mathematische Annalen, 77:136–152.
- Federico Girosi and Tomaso Poggio. 1989. Representation properties of networks: Kolmogorov’s theorem is irrelevant. Neural Computation, 1(4):465–469.
- E Hecke. 1917. Über orthogonal-invariante Integralgleichungen. Mathematische Annalen, 78:398–404.
- Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366.
- Parameter-efficient transfer learning for NLP. In International Conference on Machine Learning.
- LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations.
- LLM-Adapters: An adapter family for parameter-efficient fine-tuning of large language models. arXiv preprint arXiv:2304.01933.
- Haotian Jiang and Qianxiao Li. 2023. Approximation theory of transformer networks for sequence modeling. arXiv preprint arXiv:2305.18475.
- A brief survey on the approximation theory for sequence modelling. arXiv preprint arXiv:2302.13752.
- Large language models are zero-shot reasoners. Advances in Neural Information Processing Systems.
- Andrei Nikolaevich Kolmogorov. 1957. On the representation of continuous functions of many variables by superposition of continuous functions of one variable and addition. In Doklady Akademii Nauk, volume 114, pages 953–956. Russian Academy of Sciences.
- The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing.
- Shengqiao Li. 2010. Concise formulas for the area and volume of a hyperspherical cap. Asian Journal of Mathematics & Statistics, 4(1):66–70.
- Xiang Lisa Li and Percy Liang. 2021. Prefix-Tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers).
- Scaling down to scale up: A guide to parameter-efficient fine-tuning. arXiv preprint arXiv:2303.15647.
- On the expressive power of self-attention matrices. arXiv preprint arXiv:2106.03764.
- Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys.
- P-Tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers).
- Memorization capacity of multi-head attention in transformers. arXiv preprint arXiv:2306.02010.
- Valdir Antônio Menegatto. 1997. Approximation by spherical convolution. Numerical Functional Analysis and Optimization, 18(9-10):995–1012.
- Tin Lok James Ng and Kwok-Kun Kwong. 2022. Universal approximation on the hypersphere. Communications in Statistics – Theory and Methods, 51(24):8694–8704.
- When do prompting and prefix-tuning work? A theory of capabilities and limitations. In International Conference on Learning Representations.
- David L Ragozin. 1971. Constructive polynomial approximation on spheres and projective spaces. Transactions of the American Mathematical Society, 162:157–170.
- Learning multiple visual domains with residual adapters. In Advances in Neural Information Processing Systems.
- C. A. Rogers. 1963. Covering a sphere with spheres. Mathematika, 10(2):157–164.
- Representational strengths and limitations of transformers. arXiv preprint arXiv:2306.02896.
- Johannes Schmidt-Hieber. 2021. The Kolmogorov–Arnold representation theorem revisited. Neural Networks, 137:119–126.
- AutoPrompt: Eliciting knowledge from language models with automatically generated prompts. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP).
- Matus Telgarsky. 2015. Representation benefits of deep feedforward networks. arXiv preprint arXiv:1509.08101.
- Attention is all you need. In Advances in Neural Information Processing Systems.
- Transformers learn in-context by gradient descent. In International Conference on Machine Learning.
- Universality and limitations of prompt tuning. In Advances in Neural Information Processing Systems.
- Finetuned language models are zero-shot learners. In International Conference on Learning Representations.
- Fundamental limitations of alignment in large language models. arXiv preprint arXiv:2304.11082.
- An explanation of in-context learning as implicit Bayesian inference. In International Conference on Learning Representations.
- Pretraining data mixtures enable narrow model selection capabilities in transformer models. arXiv preprint arXiv:2311.00871.
- Are transformers universal approximators of sequence-to-sequence functions? In International Conference on Learning Representations.
- Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043.